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1  Summary 


The  overall  BioCOMP/BioSPICE  project  aims  at  integrating  a  broad  set  of  “Systems 
Biology”  measures  and  models  ranging  from  functional  genomics  to  simulation  tools. 
Our  Harvard  sub-project  focused  on  computer  aided  design  (e.g.  CAD-PAM)  and  testing 
of  various  synthetic  biology  approaches  integrated  with  improved  functional-genomics 
quantitation  tools  (e.g.  MapQuant)  in  turn  integrated  with  systems-biology  modeling  (e.g. 
Minimization  of  Metabolic  Adjustment,  MOMA).  The  entire  collection  long-term  should 
be  capable  of  cycles  of  iteration  -  CAD-Quant-Model  —  CAD-Quant-Model  . . .  Initial 
designs  focused  on  the  major  biopolymer  synthesis  pathways  (DNA,  RNA,  protein),  in 
vitro.  That  project  was  briefly  expanded  with  supplementary  funding  to  include 
experimental  work  which  was  then  transitioned  out  to  a  Dept,  of  Energy  grant  which 
resulted  in  a  Nature  paper  and  commercial  licensing  (to  CodonDevices).  The  metabolic 
modeling  (MOMA)  has  developed  as  very  useful  in  stand-alone  software  and  as  a  prime 
example  of  the  successes  and  challenges  on  merging  such  software  into  the  BioSPICE 
community  vision.  During  this  project  timeline  we  have  developed  ten  major  applications 
of  these  concepts  and  software  tools: 


a.  A  variety  of  chemical  systems  capable  of  replication  and  evolution,  fed  only  by 
small  molecule  nutrients,  is  now  designable  and  constructible.  This  could  be 
achieved  by  stepwise  integration  of  decades  of  work  on  the  reconstitution  of  DNA, 
RNA  and  protein  syntheses  from  pure  components.  Such  an  in  vitro  cell  project 
(IVCP)  would  initially  define  the  components  sufficient  for  each  subsystem,  allow 
detailed  kinetic  analyses,  and  lead  to  improved  in  vitro  methods  for  synthesis  of 
biopolymers,  therapeutics  and  biosensors.  Completion  would  yield  a  functionally 
and  structurally  understood  self-replicating  biosystem.  Safety  concerns  for  synthetic 
life  will  be  alleviated  by  extreme  dependence  on  elaborate  laboratory  reagents  and 
conditions  for  viability.  The  proposed  minimal  genomes  are  113  kilobase  pairs  long 
and  contain  151  genes. 

b.  Mathematical  models  of  diffusion-constrained  polymerase  chain  reactions  provides 
the  basis  of  high-throughput  nucleic  acid  assays  (licensed  commercially  to 
Agencourt  -  Beckmann  -  Coulter)  and  simple  self-organizing  systems  (as  in  item  1 
above) 

c.  The  integration  of  the  genomic  and  proteomic  measurements  and  system-wide 
analyses  was  dramatically  demonstrated  in  the  first  simultaneous  determination  of 
complete  genomes  and  proteomes  (initially  of  Mycoplasma  mobile). 

d.  A  cross-species  expression  data  mining  tool  is  now  realized,  and  a  first  instance  has 
been  put  in  the  public  domain,  called  yMGV  (yeast  microarray  global  viewer).  The 
comparative  approach  is  applicable  to  filling  of  gaps  in  metabolic  networks  using 
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expression  information.  The  metabolic  expression  placement  (MEP)  method  relies 
on  the  co-expression  to  predict  over  20%  of  all  known  Saccharomyces  cerevisiae 
metabolic  enzyme  encoding  genes  within  the  top  50  out  of  5594  candidates  for  their 
enzymatic  function,  and  70%  of  metabolic  genes  whose  expression  level  has  been 
significantly  perturbed  across  the  conditions  of  the  expression  dataset  used. 

e.  Expression  dynamics  of  a  cellular  metabolic  network  demonstrate  predominance  of 
local  gene  regulation.  Metabolic  genes  display  significant  co-expression  on 
distances  smaller  than  the  average  network  distance,  a  behavior  supported  by  the 
distribution  of  transcription  factor  binding  sites  in  the  metabolic  network  and 
genome  context  associations.  Positive  gene  co-expression  decreases  monotonically 
with  distance  in  the  network,  while  negative  co-expression  is  strongest  at 
intermediate  network  distances.  Basic  topological  motifs  of  the  metabolic  network 
exhibit  statistically  significant  differences  in  co-expression  behavior. 

f  While  we  and  others  have  developed  quantitation  for  full  genome  RNA  since  the 
mid  1990s,  the  protein  equivalent  has  awaited  software  like  MapQuant.  This  is 
Open-Source  Software  and  has  been  exported  to  several  groups  including  the 
largest  genomic s-proteomics  center  in  the  world,  the  Broad  Institute,  Cambridge 
where  it  is  in  routine  use. 

g.  The  important  task  of  going  from  annotated  genomes  to  metabolic  flux  models  and 
kinetic  parameter  fitting  has  been  addressed  by  a  variety  of  software  modules 
ranging  from  MOMA  to  ordinary  differential  equations  (ODE)  to  comparative 
genomic  and  geometric  constraints. 

h.  Accurate  Multiplex  Gene  Synthesis  from  Programmable  DNA  Chips  is  now  on-line. 
The  CAD-PAM  software  for  design  of  oligonucleotides  for  synthesis  on  chips  and 
assembly  into  genomes  is  publicly  available  at: 

http://arep.med.harvard.edu/cadpam.html. 

i.  Potential  Bio-Security  implications  of  some  of  the  above  work  have  been  addressed 
by  a  novel,  inexpensive,  semi-automated  means  for  surveillance  of  the  synthetic 
DNA  supply  stream  from  chemicals,  instruments,  oligos  and  genes. 
http://arep.med.harvard.edu/SBP/Church_Biohazard04c.htm 
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2  Introduction 


The  approach  was  to  develop  a  framework  that  accelerates  the  computational 
construction  of  models  of  genetic  and  metabolic  processes  that  can  be  used  to  design, 
synthesize  and  optimize  replicating  in  vivo  and  in  vitro  systems.  These  can  be  used  for 
bioengineering  applications  that  utilize  biomolecules  as  information  processing,  bio¬ 
sensing,  and  structural  components. 

The  above  approach  assumes  some  predicative  power  of  computational  models  of  genetic 
and  metabolic  processes.  The  design,  synthesis  and  optimization  of  replicating  in  vivo 
and  in  vitro  systems  required  new  technology  for  fabrication  of  synthetic  DNA, 
introduction  into  cells  and  high-throughput  monitoring  of  the  properties  of  the  synthetic 
cells.  This  synthetic  component  empowers  an  important  feedback  loop  to  the 
computational  modeling  components. 

The  major  focus  of  the  Church  Lab  BioSPICE  project  has  been  the  development  of 
multisite  use  case  software  packages.  Harvard  and  Dana  Farber  people  have  led  a  5  site 
use  case  involving  16  investigators  entitled  “From  Annotated  Genomes  to  Metabolic 
Flux  Models”.  This  use  case  starts  with  metabolic  pathway  networks  that  have  been 
loaded  into  the  Biowarehouse  from  SRI,  either  using  the  SRI  tool  called  Pathologic  to 
transform  GenBank  files  into  databases  or  in  the  future  using  a  tool  not  yet  developed  to 
convert  Systems  Biology  Markup  Fanguage  (SBMF)  metabolic  models  into 
Biowarehouse  databases.  A  tool  was  created  that  extracts  metabolic  networks  as  SBMF 
files.  These  or  any  SBMF  files  can  be  prepared  for  flux  analysis  and  elementary  flux 
mode  prediction  using  an  interactive  spreadsheet  based  on  the  BioSpreadsheet  developed 
at  the  University  of  Tennessee.  The  spreadsheet  prepares  annotated  SBMF  that  is  used 
by  the  Metatool  dashboard  analyzer  that  was  created  by  the  Keck  Graduate  Institute  to 
enumerate  the  elementary  flux  modes  of  the  network  and  the  Fluxor  program  developed 
at  Harvard  to  make  flux  predictions  for  the  model  as  well  as  predicting  the  fluxes  for  a 
knockout  mutation  requested  using  the  spreadsheet  using  both  the  traditional  linear 
optimization  approach  and  the  MOMA  approach  that  minimizes  the  changes  from  the 
wild  type  flux  predictions.  The  spreadsheet  is  used  for  a  second  time  to  display  the  flux 
predictions,  and  the  standard  BioSPICE  Table  View  analyzer  is  used  to  display  the 
pathways  associated  with  the  elementary  flux  modes. 

Harvard  has  also  collaborated  with  the  Center  for  the  Development  of  Advanced 
Computing  in  Pune,  India,  to  create  an  SBMF  representation  of  the  E.  coli  JR904  flux 
model.  The  Church  Fab  has  prepared  it  using  the  spreadsheet  and  run  a  Fluxor  analysis 
on  it,  but  this  first  attempt  did  not  yield  realistic  predictions.  The  spreadsheet  is  used  to 
diagnose  whether  this  is  a  problem  with  the  Church  lab  model  representation  or  with  the 
Fluxor  application,  one  expects  to  find  fixes  for  both. 
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The  use  ease  distribution  at  http://arep.med.harvard.edu/moma/biospieefluxor.html  is 
provided  as  a  Linux  tar  file  containing  a  dashboard  analyzer  for  extracting  reaction  lists 
from  the  Biowarehouse  as  SBML  files,  the  spreadsheet  analyzer  that  prepares  an  SBML 
file  for  use  by  Fluxor  or  Metatool  and  displays  the  Fluxor  predictions,  and  the  Fluxor 
analyzer  itself.  It  also  contains  the  current  state  of  the  E  coli  JR904  SBML  model,  the 
tab-delimited  listing  of  the  model,  and  a  tool  that  creates  a  new  SBML  file  once  the  tab- 
delimited  listing  has  been  modified.  The  Biowarehouse  and  Pathway  tools  including 
Pathologic  software  can  be  downloaded  from  links  at  http://communitv.biospice.org/  or 
http://www.metacvc.org/.  The  tar  file  on  arep.med.harvard.edu  also  includes  instructions 
for  setting  up  a  MySQL  database  to  hold  the  Biowarehouse  Structured  Query  Language 
script  that  loads  it  with  E  coli  pathway  information  derived  from  EcoCyc  and  a  small 
hypothetical  reaction  system  used  as  the  Church  lab  demonstration  case.  Biowarehouse 
can  also  use  an  Oracle  database,  but  the  Church  lab  analyzer  for  extracting  information 
from  it  as  SBME  was  written  using  MySQE  and  there  is  no  expectation  it  will  work 
unchanged  using  Oracle.  The  Keck  Institute  has  also  set  up  a  download  page  at 
http://public.kgi.edu/~spaladug/one.html,  which  includes  windows  installers  for  the 
Software  Biology  Workbench  (SBW)  including  metatool  and  other  agents  and  for 
BioSPICE  dashboard  analyzers  that  launch  each  of  these  tools.  There  are  also  Einux 
downloads  for  an  SBW  broker,  the  Metatool  agent,  and  the  NOM  agent  that  handles 
providing  an  SBML  file  to  Metatool.  It  is  also  necessary  to  copy  the  java  jar  files  for 
metatool  on  the  windows  platform  to  the  Linux  platform  to  use  these  as  dashboard 
analyzers.  Once  these  SBW  elements  have  been  set  up  on  a  Linux  machine  they  can  be 
used  in  the  same  workflows  as  the  Biowarehouse  2SBML  Eluxor  Spreadsheet,  and 
Eluxor  analyzers. 
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From  Annotated  Genomes  to  Metabolic  Flux  Models  2.0 


Project  URL:  http;//arep. med.harvard.edu/darpabiocomp/ 

Quad  Chart:  http://arep.med.harvard.edu/darpabiocomp/Quad04_GC.ppt 

Figure  1.  From  Annotated  Genomes  to  Metabolic  Flux  Models 


Objective 

This  project  explores  connections  between  computational  systems  biology  and  synthetic 
biology  (bio-input/output,  DNA  memory  &  bio-manufacturing  processes).  This  is  done 
using  the  only  class  of  programmable  nanometer  scale  replicators  (i.e.  polymerase- 
ribosome -based).  The  major  challenge  is  integration  with  silicon  computing.  The 
motivations  are  bio-monitoring  of  spatially  patterned  light,  chemicals,  and  toxins. 
Software  is  aimed  at  BioCOMP/BioSPICE  compatibility  and  emphasizes  computational 
tools  for  analyzing  complex  metabolic  networks  and  related  synthetic  biology  goals. 


Approach 

The  approach  is  to  develop  a  framework  that  accelerates  the  computational  construction 
of  models  of  genetic  and  metabolic  processes  that  can  be  used  to  design,  synthesize  and 
optimize  replicating  in  vivo  and  in  vitro  systems.  These  can  be  used  for  bioengineering 
applications  that  utilize  biomolecules  as  information  processing,  biosensing,  and 
structural  components. 
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The  Systems  Biology  Markup  Language  (SBML)  is  a  computer-readable  format  for 
representing  models  of  biochemical  reaction  networks.  SBML  is  applicable  to  metabolic 
networks,  cell-signaling  pathways,  and  genomic  regulatory  networks.  The  Systems 
Biology  Workbench  (SBW)  is  a  modular,  broker-based,  message-passing  framework  for 
simplified  communication  between  applications  that  aid  in  the  above.  The  SBML  module 
included  the  Network  Object  Model  (NOM)  and  MetaToolSBW,  a  network  analysis  tool. 
These  help  to  develop  a  pipeline  from  genome  sequence  and  annotation  to  metabolic  and 
genetic  network  optimization  models.  This  employs  linear  and  quadratic  programming 
math  modules.  This  is  done  in  the  context  of  experimental  validation  with  an  emphasis 
on  integrating  quantitative  mass  spectrometry,  RNA  tags,  and  effects  of  metabolic 
inhibition  and  related  stresses. 

In  order  to  improve  the  performance  of  the  fabrication  and  memory  tools,  in  vitro 
replication/translation  arrays  will  provide  for  experimental  feedback.  A  120kbp 
minigenome  design  shows  capability  for  replication  and  protein-synthesis.  This 
minigenome  will  be  6  times  smaller  than  the  smallest  living  cellular  genomes  with  1000- 
fold  fewer  molecular  components.  These  in  vitro  systems  are  ideal  for  integrating  with 
detailed  computational  models,  due  to  simplicity,  knowledge  of  the  3D  structure  of 
nearly  all  components  and  extreme  experimental  accessibility.  Also  coupling  the 
extremes  of  modeling  (from  single  base  changes  to  3D  structures  to  molecular  networks 
to  population  doubling  selection)  is  likely  to  be  dramatically  more  transparent  and 
tractable. 

Novel,  Useful  Applications  &  technology  transfer:  The  focus  is  on  practical  applications 
that  take  advantage  of  the  unique  features  of  DNA  and  metabolic  systems.  Examples  are: 
(a)  proven  Myr  information  archiving  and  retrieval;  (b)  interfacing  with  biochemical, 
photon,  or  thermal  sensors,  (c)  A  DNA  recorder  analogous  to  black-box  flight  recorder 
would  take  early  advantage  of  the  ability  to  record  on  DNA  more  easily  than  reading  it. 
Only  rarely  would  the  archived  materials  be  accessed. 

In  vitro  minigenome  synthesis: 

The  Church  group  led  by  Dr.  Tian  has  shown  that  the  multi-his-tag  Western  blots  are  a 
reliable  assay.  Hence,  the  minigenome  genes  have  been  moved  into  his-tagged  "in  vitro" 
linear-vectors.  This  included  synthesis  of  tagged  and  untagged  forms  for  all  23  genes  of 
the  30S-ribosomal  subunit.  From  these  the  Church  group  has  synthesized  all  of  the 
RNAs  in  vitro  and  most  of  the  proteins.  The  low  levels  of  protein  synthesis  observed  for 
a  few  of  these  normally  very  abundant  proteins  is  rapidly  revealing  key  design  criteria  for 
codon  usage  and  secondary  structure  of  the  mRNAs.  The  group  has  developed  software 
for  general  gene  and  genome  design  tools  that  takes  these  observations  into  account. 

A  new  method  was  developed  for  large  scale  synthesis  of  genes  or  genomes  which  has 
the  potential  of  being  100-fold  less  expensive.  This  has  been  successfully  tested  by 
synthesizing  two  full-length  genes  from  the  minigenome  (rs3  &  rs5)  from  a  mixture  of 
512  chemically  synthesized  VOmers.  A  report  of  invention  has  been  filed  with  Harvard 
Medical  School-Office  of  Technology  Licensing.  This  is  a  major  milestone  for  this 
project  supplement.  Joined  by  Hui  Gong  (for  the  oligo  design),  Nijing  Sheng  (gene 
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assembly  expert,  Researeh  Asst.  Prof,  from  the  Univ.  of  Houston)  and  a  Harvard 
undergraduate,  and  CS-graduate  student.  So  this  projeet  is  likely  to  continue  to  progress 
capturing  this  recent  momentum. 

Computational: 

Church  lab  work  on  close-to-optimal  networks  as  might  occur  in  mutants,  "Minimization 
of  Metabolic  Adjustment"  (MoMA)  has  been  extended  to  allow  automated  access  to  new 
genomes  (Daniel  Segre  &  Dennis  Vitkup  assisted  by  Jeremy  Zucker,  Tamar  Mentzel, 
and  Jeremy  Katz)  requiring  only  Kyoto  Encyclopedia  of  Genes  and  Genomes  or 
Genbank  annotations  as  staring  points. 

Table  1.  SBML  models  from  BioCyc  version  7.5  for  14  organisms 

Agrobacterium- tumefaciens  .xml 

Bacillus-subtilis.xml 

Caulobacter-crescentus.xml 

Chlamydia- trachomatis .  xml 

Escherichia-coli.xml 

Haemophilus-influenza.xml 

Helicobacter-pylori.xml 

Mycobacterium-tuberculosis-CDC  1551  .xml 

Mycobacterium-tuberculosis-H37Rv.xml 

Mycoplasma-pneumoniae.xml 

Pseudomonas-aeruginosa.xml 

Saccharomyces-cerevisiae.xml 

Treponema-pallidum.xml 

Vibrio-cholerae.xml 

Plus,  a  program  called  biocyc2sbml.lisp  which  can  take  any  organism  in  a 
Pathway/Genome  database  and  generate  the  corresponding  SBML  model.  The  URL  to 
download  these  models  is  http://genome.dfci.harvard.edu/~zucker/BPHYS/sbml.zip 

An  in  vitro  coupled  replicating  and  translating  system  is  based  on  pure  bacterial  E.coli 
translation.  Novel  developments  include  (1)  a  linear  expression  clone  system  compatible 
with  the  most  powerful  in  vitro  replication  system,  polymerase  chain  reaction  (PCR),  and 
(2)  a  modular  method  for  computer  gene  design  and  automated  gene  synthesis  including 
affinity-tagging  for  all  ribosomal  proteins. 

The  group  merged  Minimization  of  Metabolic  Adjustment  (MOMA)  software  with 
BioSPICE  &  SBML  tools  to  allow  optimization  of  metabolic  network  utilization  in 
mutant  genotypes  and  experimentally  tested  using  metabolic  fluxes  (from  Uwe  Sauer’s 
group)  and  a  new  high-throughput  method  for  measuring  growth  rates  of  hundreds  of 
mutants  in  parallel. 

The  group  developed  methods  for  3D  &  4D  modeling  of  bacterial  cells  and  replication 
translation  of  their  circular  chromosomes.  In  addition  the  Church  lab  has  ID  to  4D 
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models  of  expansion  of  an  in  vitro  DNA  eolony.  High  resolution  atomie-foree 
mieroseopy  images  have  been  obtained  for  heavy-atom  labeled  DNA  samples. 

The  group  eompleted  integration  of  genome  sequenee  for  Mycoplasma  mobile  and  M. 
pneumoniae  with  "complete"  proteome  comparisons.  These  are  proving  crucial  for 
integration  and  4D-modeling  efforts  and  relevant  to  understanding  these  simple 
pathogens  as  biosystems. 


3  Minimal  Cell  Design  Tools 


3.1  Methods,  Assumptions,  and  Procedures 
“How  far  can  we  push  chemical  self-assemhly?" 

This  question  was  posed  recently  as  one  of  the  big  25  questions  in  science  for  the  next  25 
years  (Service,  2005).  Nowadays,  big  questions  often  are  addressed  by  big  experimental 
efforts.  But  before  embarking  on  a  big  project,  it  is  helpful  to  get  specific.  What  push  in 
chemical  self-assembly  might  be  most  worthwhile  and  practical?  Self-assembly  in  vitro 
of  viruses  and  the  ribosome,  achieved  decades  ago,  taught  us  some  of  the  principles 
assumed  to  be  used  in  general  by  cells  (Lewin,  2004).  For  example,  self-assembly  occurs 
in  a  definite  sequence  and  is  generally  energetically  favored,  obviating  the  need  for 
enzymes  and  an  energy  source.  Assembling  some  type  of  cell  would  seem  to  be  the  next 
major  step,  yet  detailed  plans  have  not  been  published.  Here,  we  attempt  to  outline  the 
synthesis  of  a  minimal  cell  containing  the  core  cellular  replication  machinery,  review  the 
pertinent  literature,  and  highlight  gaps  in  knowledge  that  need  filling. 

Utility 

Synthesizing  a  minimal  cell  will  advance  knowledge  of  biological  replication.  Many 
hypotheses  in  replication  and  its  subsystems  can  only  be  tested  in  such  a  synthetic 
biology  project.  The  meaning  of  “synthetic”  (from  Greek  synthesis,  to  put  together) 
discussed  here  bypasses  the  current  reliance  of  synthetic  biology  on  cells  or 
macromolecular  cell  products:  the  aim  is  to  put  together  an  organism  from  small 
molecules  alone.  The  simplest  approach  for  creating  an  artificial  cell  may  be  by  evolving 
an  RNA  polymerase  made  exclusively  of  RNA  (Szostak  et  ah,  2001)  to  replace  all 
protein  components  of  in  vitro  replicating  and  evolving  systems  {e.g.  to  replace  QP 
replicase  (Mills  et  ah,  1967)).  But  in  comparison  with  a  purified  protein-based  system,  it 
is  neither  guaranteed  to  arrive  sooner  nor  tell  us  more.  A  protein-based  system  will 
connect  with,  and  reveal  more  about,  existing  biological  systems.  Life,  like  a  machine, 
cannot  be  understood  simply  by  studying  it  and  its  parts;  it  must  also  be  put  together  from 
its  parts.  Along  the  way  to  synthesizing  a  cell,  we  might  discover  new  biochemical 
functions  essential  for  replication,  unsuspected  macromolecular  modifications,  or 
previously  unrecognized  patterns  of  coordinated  expression. 


How  good  a  model  would  an  artificial,  protein-based,  minimal  cell  be  for  natural  cells? 
The  only  cellular  alternative  is  a  perturbed  natural  cell,  an  incredibly  complex  system 
even  for  the  simplest  of  cells.  A  much  simpler  purified  system  based  on  a  real  cell  would 
thus  be  easier  to  model  and  understand.  It  could  certainly  answer  questions  that  cannot  be 
answered  in  vivo  or  in  crude  extracts,  such  as  which  macromolecules  and 
macromolecular  modifications  are  sufficient  for  subsystem  function.  However,  even  the 
simplest  minimal  cell  would  still  be  highly  complex,  so  its  construction  and  study  would 
be  facilitated  by  substituting  some  of  the  necessary  subsystems  with  simpler  analogs. 
Should  the  simpler  in  vitro  model  turn  out  to  be  a  poor  model  for  the  more  complex  in 
vivo  system,  one  could  always  construct  a  more  complex  in  vitro  system  that  may  better 
reflect  in  vivo. 

Synthesizing  a  cell  will  also  lead  to  new  applications.  Purified  biochemical  systems 
already  offer  major  advantages,  such  as  PCR  and  in  vitro  transcription.  A  better 
understanding  and  manipulation  of  all  cellular  replication  subsystems  (molecular 
biology’s  tool  kit)  should  spin  off  new  technologies.  For  example,  in  vitro  genome 
replication  may  be  useful  for  replicating  very  large  segments  of  DNA  with  high  fidelity. 
Combined  in  vitro  transcription,  RNA  processing  and  RNA  modification  would  allow 
preparation  of  rRNAs  and  tRNAs  with  defined  modifications  to  test  the  roles  of  the 
modifications,  and  modified  tRNAs  to  aid  incorporation  of  unnatural  amino  acids  into 
proteins.  Purified  translation  systems  have  enabled  reassignment  of  mRNA  codons  to 
encode  unnatural  amino  acids  by  omission  of  competing  natural  amino  acids  (Forster  et 
ah,  2003);  further  improvements  of  the  purified  translation  system  could  enable  the 
genetic  selection  of  protease-resistant,  peptide-like  ligands  for  drug  discovery  by  pure 
translation  display  (Forster  et  ah,  2004).  The  purified  translation  system  may  also 
facilitate  expression  of  proteins  difficult  to  express  by  standard  approaches.  Better 
control  of  lipid  vesicle  synthesis  could  advance  liposome-based  drug  delivery.  Since 
bacterial  translation  is  the  main  target  of  antibiotics,  greater  understanding  may  assist 
development  of  new  drugs  to  fight  mounting  antibiotic  resistance.  Ultimate  success  in 
cell  synthesis  could  generate  useful  microorganisms,  e.g.  for  renewable  production  of 
biodegradable  plastics  (Pohorille  and  Deamer,  2002). 


3,1,1  Approach 

The  ideal  approach  for  synthesizing  a  cell  would  allow  all  of  the  machine  parts  to  be 
understood  and  tested.  Like  any  engineering  project,  this  requires  detailed  blueprints,  raw 
synthetic  capabilities  and  an  overall  diagnostic  and  debugging  strategy.  The  use  of  entire 
genomes  as  the  blueprints,  some  of  which  are  small  enough  to  synthesize  de  novo,  is 
inconsistent  with  this  approach.  Self-replication  of  an  unadulterated  genome,  however 
impressive,  would  not  define  the  unnecessary  genes,  and  the  functions  of  about  a  third  of 
the  genes  would  remain  unknown  (Fraser  et  ah,  1995;  Jaffe  et  ah,  2004).  Building  a 
machine  from  mysterious  parts  can  only  create  a  mysterious  machine.  What  is  needed  is 
some  way  of  defining  a  near-minimal  genome  and  then  a  strategy  that  will  lead 
inexorably  to  an  understanding  of  all  of  its  parts. 
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Theoretical  and  experimental  studies  have  attempted  to  establish  a  minimal  set  of  genes 
needed  for  a  self-replicating  system  in  a  cushy  constant  environment  of  unlimited,  small 
molecule  nutrients  {e.g.  nucleotide  triphosphates,  amino  acids,  lipids  and  cofactors). 
Three  basic  approaches  present  themselves. 

Comparative  genomics  searches  for  genes  that  have  homologs  in  the  genomes  of  groups 
of  organisms.  The  approach  estimates  from  50  to  380  genes  in  a  minimal  genome  (Jaffe 
et  ah,  2004;  Koonin,  2000;  Mushegian  and  Koonin,  1996;  Tomita  et  al.,  1999).  It  has  the 
caveat  that,  among  closely  related  genomes,  some  genes  appear  “required”  for  those 
species  although  they  are  not  required  for  basic  life.  If  one  goes  to  longer  evolutionary 
distances,  many  gene  functions  are  replaced  by  non-homologous  genes,  hence  making 
some  essential  genes  look  dispensable  (e.g.  the  tRNA  modification  enzymes  used  by 
Mycoplasma  are  either  different  from  E.  coli  or  unidentified  by  sequence  identity,  but 
that  doesn't  mean  the  different  ones  are  dispensable).  An  additional  challenge  is  that 
about  a  third  of  the  essential  genes  have  unknown  functions.  It  is  thus  expected  that  a 
minimal  genome  based  on  this  approach  alone  would  be  unviable,  and  it  would  not  be 
possible  to  identify  the  missing  essential  genes. 

Genetics  searches  for  essential  genes  by  mutating  one  gene  at  a  time.  This  approach 
estimates  330  genes  in  a  minimal  genome  (out  of  Mycoplasma  genitalium's  total  of  517; 
(Hutchison  et  ah,  1999).  Again,  about  a  third  of  the  essential  genes  have  unknown 
functions.  It  is  limited  by  false  “essentials”  due  to  the  fraction  of  genes  that  were  never 
mutated  in  the  screen,  due  to  creation  of  toxic  partial  complexes  or  pathways,  and  due  to 
inadvertent  effects  on  adjacent  genes.  The  latter  effects  are  prevalent  in  bacteria  because 
a  primary  RNA  transcript  typically  encodes  multiple  gene  products.  At  the  other  extreme, 
false  "dispensables"  are  disastrous  when  trying  to  assemble  a  viable  minimal  genome  that 
lacks  all  of  the  individual  "dispensables".  For  example,  most  RNA  modification  enzymes 
are  individually  dispensable,  but  simultaneous  deletion  of  tens  of  them  would  be 
expected  to  be  unsustainable  due  to  cumulative  reductions  in  efficiency  or  fidelity  (a 
useful  working  definition  of  essentials  for  a  minimal  genome  should  encompass  such 
lethal  “dispensables”).  Again,  in  using  this  approach  alone,  it  would  not  be  possible  to 
identify  the  missing  essential  genes. 

Biochemistry  identifies  from  cell  fractions  those  gene  products  essential  for  the 
reconstitution  of  biochemical  reactions.  It  does  not  suffer  from  the  above  problems 
(except  creation  of  toxic  partial  complexes),  gives  access  to  details  of  kinetic  steps  and 
allows  debugging  of  isolated  subsystems.  However,  the  cellular  subsystems  must  be 
integrated  and  thoroughly  tested  for  accuracy  on  long  templates  before  they  can  be 
considered  physiological.  Nevertheless,  the  biochemical  approach  has  been  successful  at 
identifying  macromolecules  sufficient  for  reconstituting  DNA,  RNA  and  protein 
syntheses  and,  based  on  individual  subtraction  experiments,  the  components  have  either 
been  shown  to  be  necessary  or  could  be  so  tested.  Mindful  of  the  remaining  self¬ 
replication  functions  that  need  to  be  discovered  (see  below)  it  seems  likely  that  a  largely 
biochemical  approach,  now  further  empowered  by  mass  spectrometry  analyses  and 
genetic  and  comparative  genomic  information,  will  be  the  most  practical  route  to  define  a 
near-minimal,  well-understood  genome.  We  now  review  the  relevance  of  current 
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knowledge  and  teehnology  to  this  new  minimal  eell  projeet  (MCP),  (Luisi,  2002). 

A  minimal  genome 

A  MCP  may  be  realized  by  reconstituting  the  macromolecular  catalysts  that  synthesize 
DNA,  RNA  and  protein.  However,  this  overlooks  the  formation  of  the  membrane 
compartment  and  the  poorly-understood  process  in  which  it  is  divided  by  membrane 
proteins  (Gitai,  2005),  both  of  which  are  required  for  life.  But  lipids  alone  have  been 
shown  to  be  sufficient  for  formation  of  rudimentary  membranous  compartments  capable 
of  both  transmembrane  transport  of  small  molecules  and  fission  autocatalytically 
(Szostak  et  ah,  2001),  so  membrane  proteins  may  be  dispensable.  Polysaccharides  should 
also  be  dispensable.  If  the  simplest  and  best  characterized  examples  of  DNA,  RNA  and 
protein  synthesis  are  selected,  if  translation  of  all  codons  is  enabled  for  generalizability, 
and  if  efficiency  and  accuracy  are  not  compromised,  then  this  leads  to  the 
macromolecules  and  pathways  of  Figure  Al. 

A  detailed  list  of  the  gene  products  in  the  hypothetical  synthetic  minimal  cell  of  Figure 
Al  is  shown  in  Table  Al,  left  column.  This  list  overlaps  with  a  computational  model  of  a 
minimal  cell  gene  largely  derived  from  a  minimal  organism.  Mycoplasma  genitalium 
(Tomita  et  ah,  1999),  but  differs  by  omitting  enzymes  for  synthesizing  small  molecules 
(e.g.  lipids  and  glycolysis  substrates)  and  by  including  DNA  replication,  RNA 
processing,  RNA  modification,  extra  tRNAs  to  decode  the  whole  genetic  code,  some 
additional  essential  translation  components,  and  chaperones,  ft  should  be  emphasized  that 
Table  Al  is  a  working  model  only  and  that  strict  adherence  will  likely  hamper  progress. 
Examples  of  omitted,  potentially  stimulatory  genes  are  given  below.  Conversely, 
examples  of  included,  potentially  dispensable  genes  may  be  gleaned  by  comparison  with 
the  streamlined  Mycoplasma  genome  (Fraser  et  ah,  1995). 

Several  conclusions  can  be  drawn  from  the  provisional  list  of  genes  selected  for  a 
minimal  cell,  most  of  which  are  attractive  when  contemplating  a  MCP.  In  genomic  terms, 
the  list  is  very  short,  containing  only  151  genes  and  113  kbp.  All  of  the  genes  are  derived 
from  E.  coll  and  its  bacteriophages  (except  for  the  hammerhead  RNA  from  a  plant  virus 
(Forster  and  Symons,  1987)),  implying  that  the  individual  subsystems  will  be  compatible. 
In  contrast  to  lists  derived  by  comparative  genomics  or  genetic  approaches,  the 
biochemically-based  list  does  not  contain  any  genes  of  unknown  function  or  challenging 
membrane  proteins,  so  it  is  close  to  a  fully  understood,  accurately  replicating  “platform” 
for  life.  The  few  known  gaps  constitute  only  about  seven  genes,  all  of  which  are 
predicted  to  be  for  RNA  modification  (Table  Al,  yellow  in  left  column).  From  the 
viewpoint  of  structural  biology,  courtesy  of  recent  breakthroughs  in  ribosome  structure 
determination  (Diaconu  et  ah,  2005;  Ogle  and  Ramakrishnan,  2005),  significant  three- 
dimensional  information  is  lacking  for  only  3%  of  the  products:  a  few  RNA  modification 
proteins  and  aminoacyl-tRNA  synthetases  (Table  Al,  yellow  in  right  column).  While 
some  of  the  states  and  complexes  remain  to  be  solved  at  high  resolution,  a  draft  three- 
dimensional  structure  for  any  replicating  system  is  a  major  milestone  in  the  history  of 
biology. 
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3.1,2  Tools 

Genes  for  a  MCP  eould  be  synthesized  using  either  natural  or  unnatural  gene  sequenees 
as  starting  points.  Using  natural  gene  sequenees,  genes  ean  be  readily  synthesized  by 
PCR,  and  large  eloned  operons  of  essential  genes  ean  be  fused  using  synthetie  linkers  and 
homologous  reeombination.  However,  gene  synthesis  by  eloning  and  PCR  will  soon  be 
more  expensive  than  raw  synthesis  from  synthetie  oligodeoxyribonueleotides  (oligos). 
The  latter  also  allows  unnatural  sequenees,  sueh  as  versions  with  altered  eodon  bias  to 
adjust  mRNA  seeondary  struetures  (Tian  et  ah,  2004).  Sealability  and  eost  limitations  of 
established  methods  for  gene  synthesis  from  synthetie  oligos  are  now  being  overeome  by 
oligo  synthesis  on  ehips  followed  by  PCR  amplifieation  and  error-eorreetion  (Carr  et  ah, 
2004;  Richmond  et  ah,  2004;  Tian  et  ah,  2004;  Zhou  et  ah,  2004). 

3.2.  Results  and  Discussion 


3.2,1  Biochemical  Subsystems 

Several  biochemical  subsystems  are  required  to  synthesize  a  minimal  cell,  and  they  are 
reviewed  here.  For  each  subsystem,  possible  examples  from  natural  systems  will  be 
compared,  gaps  in  knowledge  will  be  identified,  and  diagnostic  and  debugging  strategies 
to  fill  the  gaps  will  be  suggested.  Mindful  of  the  goal  of  integration  of  the  subsystems, 
emphasis  is  placed  on  subsystems  that  are  homologous  and  that  operate  under  standard 
physiological  conditions. 

Genome  replication 

In  principle,  the  genetic  material  for  a  MCP  could  be  either  DNA  or  RNA.  Although  an 
RNA  genome  has  the  advantage  of  obviating  genes  for  DNA  replication,  the  challenges 
of  preventing  inhibitory  double-stranded  RNA  structures  and  replicative  mutations  in 
artificial  RNA  genomes  (Mills  et  ah,  1967)  are  unsolved.  So  the  genetic  material  for  a 
MCP  should  be  DNA. 

A  simple  possible  scheme  for  DNA  replication  that  could  be  completely  integrated  with 
biological  systems  is  shown  in  Figure  A2.  It  shows  rolling-circle  DNA  strand 
displacement  (Zhong  et  ah,  2001)  initiated  with  RNA  transcript  primers  synthesized  in 
situ  by  an  RNA  polymerase.  Processing  of  the  resulting  double-stranded  DNA 
concatemers  into  monomeric  DNA  circles  occurs  by  homologous  recombination  at  Lox 
sites  catalyzed  by  Cre  recombinase  (Sauer,  2002).  This  approach  has  advantages  over 
existing  rolling-circle  (Dahl  et  ah,  2004)  or  PCR  (Mitra  and  Church,  1999)  replication 
methods  since  it  requires  neither  solid  phase  oligo  synthesis  nor  changes  in  temperature, 
and  is  far  simpler  than  natural  DNA  replication  systems  (Khan,  1997). 

Rolling-circle  DNA  strand  displacement  could  be  engineered  in  a  stepwise  manner.  First, 
a  simpler  version  could  be  tested  in  which  the  T7  RNA  polymerase  and  RNA  processing 
are  substituted  by  addition  of  short  RNA  primers.  The  efficiency  of  synthesis  of 
monomeric  DNA  circles  would  be  followed  by  gel  electrophoresis  (Dahl  et  ah,  2004), 
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and  replication  fidelity  at  the  base  pair  and  whole  genome  levels  should  be  tested  with 
different  polymerases.  The  biggest  challenge  anticipated  is  boosting  the  efficiency  of 
monomeric  circular  template  generation  over  byproducts,  such  as  linear  DNAs  or 
oligomeric  circles.  Such  defective  byproducts  would  also  be  replicated  and  compete  for 
nutrients  (like  PCR  deletion  products  or  defective  interfering  viruses).  Defective 
byproducts  potentially  could  be  weeded  out  with  appropriate  selection  schemes.  For 
example,  encapsulation  of  individual  genomes  within  membranous  cells  would  result  in 
non-viability  of  cells  containing  deleted  genomes. 

Transcription 

A  single  RNA  polymerase  should  suffice  for  a  MCP.  Either  E.  coli's  multi-subunit 
enzyme  (Lewin,  2004)  or  the  single  polypeptide  enzyme  encoded  by  coliphage  T7 
(Studier  et  ah,  1990)  seem  best,  with  the  choice  influenced  by  several  considerations  that 
also  determine  possible  modes  of  regulation.  In  considering  the  whole  transcription  cycle 
for  a  minimal  replicating  system,  the  simpler,  more  predictable  T7  RNA  polymerase  is 
arguably  a  better  starting  point  than  the  E.  coli  RNA  polymerase  (a  detailed  comparison 
is  provided  in  supplementary  text). 

RNA  processing 

A  host  of  RNases  cleave  precursor  RNAs  in  vivo  (Li  and  Deutscher,  1996)  with  a 
complexity  that  could  be  reproduced  in  a  MCP.  However,  inclusion  of  these  RNases 
comes  with  the  risks  of  cryptic  cleavages,  and  a  simpler  approach  may  be  easier  to 
engineer  (Figure  A2,  top).  This  approach  generates  all  required  unadulterated  termini; 
tRNA  5’  and  3’  ends  (Forster  and  Altman,  1990)  and,  if  necessary,  the  3’  end  of  a  rRNA. 
The  self-cleaving  sequence  (Forster  and  Symons,  1987)  is  included  because  precursor 
tRNAs  with  substantial  3’  extensions  can  be  poor  substrates  for  RNase  P  (Li  and 
Deutscher,  1996)  and  RNA  polymerase  terminators  are  inefficient.  The  efficiency  of 
RNA  processing,  monitored  by  gel  electrophoresis,  could  be  improved  by  trying  several 
different  precursor-specific  sequences. 

A  minimal  translatome 

The  most  complex  universal  biological  machinery  is  clearly  translation.  Translation- 
associated  genes  (the  "translatome")  account  for  a  large  fraction  of  cellular  genes,  96%  of 
the  genes  in  Table  Al,  and  all  of  the  currently  predicted  gaps  in  knowledge  for  a  MCP. 
The  eukaryotic  version  is  less  attractive  for  engineering  than  the  bacterial  version 
because  it  contains  some  30  initiation  factor  proteins  and  because  eukaryotic  ribosome 
assembly  in  vitro  awaits  the  coordination  of  more  than  a  hundred  non-ribosomal 
macromolecules  (Fromont-Racine  et  ah,  2003).  Of  the  bacterial  systems.  Mycoplasma 
has  advantages  over  E.  coli  due  its  eight-fold-smaller  minimal  genome  and  its  simple  set 
of  29  tRNAs  that  is  the  only  completely  characterized  set  (Andachi  et  ah,  1989)). 
Unfortunately,  other  important  biochemical  information  for  Mycoplasma  is  essentially 
unknown  in  areas  where  it  is  well-studied  in  E.  coli  {e.g.  reconstitution  of  ribosomes  and 
translation, 

characterization  and  functional  assays  of  rRNA  modifications,  characterization  of  RNA 
modification  enzymes).  Presently,  this  seems  to  favor  the  E.  coli  translatome  for  a  MCP. 
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Purified  translation 

Efficient  synthesis  of  proteins  has  been  reconstituted  from  purified  natural  components 
(Kung  et  ah,  1978)  or  recombinant  His-tagged  translation  factors  (Shimizu  et  ah,  2005) 
from  E.  coli,  but  not  yet  from  eukaryotes.  The  next  steps  with  the  E.  coli  system  will  be 
verifying  accuraey  by  mass  spectrometry  and  extending  the  short  lifetime  of  the  batch 
mode  by  eontinuous  dialysis  (Spirin  et  ah,  1988).  The  versatility  of  the  system  will 
beeome  apparent  as  more  mRNAs  are  translated.  If  stronger  mRNA  secondary  structures 
prove  inhibitory  despite  the  helicase  aetivity  of  the  ribosome  (Takyar  et  ah,  2005), 
introduetion  of  an  RNA  helicase  may  be  helpful.  Given  that  aminoaeyl-tRNA 
synthetases,  translation  factors,  and  ribosomal  proteins  are  among  the  most  abundant 
proteins  in  the  cell,  it  will  be  important  to  verify  that  the  purified  system  can  produce 
high  concentrations  of  all  of  these  proteins. 

An  in  vitro  ribosome 

The  ribosome  of  ehoiee  is  from  E.  coli  because,  in  eontrast  with  its  eukaryotic  cousins,  it 
has  been  self-assembled  from  its  purified  components  (Nierhaus  and  Dohme,  1974; 
Nomura  and  Erdmann,  1970;  Traub  and  Nomura,  1968)  and  is  homologous  with  the  other 
eomponents  of  the  gene  list  (Table  Al).  Reeonstituted  ribosomes  have  only  been  assayed 
by  synthesis  of  phenylalanine  polymers  from  polyU  templates  (Eietzke  and  Nierhaus, 
1988),  so  future  assays  need  to  test  initiation  and  elongation  at  non-UUEl  codons,  and 
also  termination.  Eurthermore,  the  self-assembly  protocol  is  finicky  and  non- 
physiological.  In  vitro  assembly  of  the  308  subunit  under  physiological  temperatures  has 
been  attained  reeently  by  adding  the  DnaK/DnaJ/GrpE  ehaperone  system  (Maki  and 
Culver,  2005),  although  this  system  is  dispensable  in  vivo  (El  Hage  et  ah,  2001).  Perhaps 
addition  of  natural  polyamines  might  overcome  the  requirement  for  an  unphysiologically 
high  concentration  of  magnesium  ions.  All  54  of  the  ribosomal  proteins  have  been  eloned 
((Culver  and  Noller,  1999;  Semrad  et  ah,  2004);  the  hypothesis  that  they  (and  other 
proteins  in  Table  Al)  can  be  synthesized  in  a  purified  translation  system  in  active  forms 
warrants  testing. 

rRNA  production  in  a  purified  system  is  complieated  by  post-transeriptional  nucleoside 
modifications.  Since  5S  rRNA  lacks  nucleoside  modifieations  and  is  short,  it  is  not 
surprising  that  it  is  active  when  transcribed  in  vitro  (Zvereva  et  al.,  1998).  But  the  other 
two  rRNAs  are  modified  by  about  20  enzymes  in  E.  coli,  half  of  which  are  unidentified. 
All  11  modifieations  of  the  E.  coli  small  subunit  16S  rRNA  are  dispensable  for  subunit 
assembly  and  aminoacyl-tRNA  binding  (Krzyzosiak  et  al.,  1987).  However,  E.  coli  23S 
rRNA  lacking  its  23  modifications  is  30-fold  less  aetive  than  the  natural  version  in  N-Ac- 
Met-puromyein  synthesis  (Semrad  and  Green,  2002)  due  to  one  to  six  modifications  in  a 
relatively  small  RNA  domain  (Green  and  Noller,  1996).  The  enzymes  that  catalyze  these 
six  modifications  are  therefore  ineluded  in  Table  Al,  although  the  two  known  ones  are 
individually  dispensable  (Del  Campo  et  al.,  2001).  Other  bacteria  should  also  be 
entertained  for  a  MCP,  as  these  six  E.  coli  modifieations  are  not  conserved  and  the 
unmodified  23  S  RNAs  from  two  other  eubacteria  are  quite  active  (Green  and  Noller, 
1999;  Khaitovich  et  al.,  1999). 
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In  vitro  tRNAs 

Which  of  the  myriad  tRNA  genes  and  tRNA  modifieation  enzymes  are  likely  to  be 
sufficient  to  decode  all  61  sense  eodons  in  a  MCP?  There  are  some  85  tRNA  genes  in  E. 
coli  eoding  for  some  45  different  tRNAs  eaeh  bearing  post-transeriptional  modifieations 
on  about  10%  of  their  nucleosides,  and  a  fifth  of  the  tRNAs  still  remain  to  be 
eharaeterized  at  the  modifieation  level.  At  least  27  different  types  of  nueleoside 
modifieations  are  present  in  E.  coli  (Bjork,  1995).  There  are  an  estimated  40-50  tRNA 
modifieation  enzymes  in  E.  coli,  about  half  of  whieh  remain  to  be  identified.  To  make 
matters  worse  (or  more  interesting)  for  a  MCP,  the  roles  of  the  tRNA  modifieations  are 
eontroversial. 

Arguments  for  choosing  essential  tRNA  modification  activities  are  highly  speeulative.  As 
few  as  33  E.  coli  tRNAs  may  be  suffieient  to  translate  the  entire  genetie  eode  aecurately 
(Table  Al,  left;  Figure  A3).  E.  coli  tRNAs  eould  be  substituted  with  the  eompletely 
eharaeterized  set  from  Mycoplasma  capricolum,  whieh  eontains  only  14  types  of 
nueleoside  modifieations  (Andaehi  et  ah,  1989),  some  of  whieh  differ  from  E.  coli. 
However,  the  predieted  savings  in  number  of  essential  tRNAs  and  modifieation  enzymes 
are  minor  (Table  Al,  middle  eolumn),  and  full  eompatibility  with  the  heterologous  E. 
coli  translation  apparatus  seems  unlikely  {e.g.  the  eodon  UGA  in  Mycoplasma  eneodes 
Trp,  not  stop). 

Eaeh  in  vitro-synthesized  naseent  tRNA  transeript  should  be  modified  with  different 
eombinations  of  modifieation  enzymes  and  tested  for  effieieney  and  aeouraey  of  eodon 
reeognition  in  translation,  initially  in  a  simplified  purified  translation  system  (Forster  et 
al.,  2001).  Identifieation  of  the  unknown  modifieation  enzymes  is  being  hastened  by 
bioinformatic  and  genomic  approaches  (Soma  et  al.,  2003).  It  is  also  eoneeivable,  though 
unlikely,  that  unknown  small  moleeules  would  need  to  be  identified  bioehemieally  for 
RNA  modifieation  (or  other  reactions).  The  remaining  E.  coli  tRNA  modification 
enzymes  not  listed  in  Table  Al  might  be  predieted  to  be  dispensable  based  on  available 
data  (Bjork,  1995;  Giege  et  al.,  1998).  But  given  the  uneertainties,  it  may  be  faster  to  get 
to  a  working  near-minimal  eell  by  using  every  known  E.  coli  modifieation  enzyme. 

Post-translation 

A  MCP  must  promote  eorrect  protein  folding  and  any  neeessary  post-translational  amino 
aeid  modifieations.  Early  versions  of  a  purified  replieating  system  will  contain  cell- 
derived  maeromoleeules,  so  establishing  that  sueh  systems  ean  be  eompletely  weaned 
from  eells  will  require  enough  rounds  of  replieation  for  “infinite”  dilution  of  the  starting 
maeromoleeules.  This  will  test  for  dependenee  on  folding  by  ehaperones  and  on  post- 
translational  modifieations.  It  is  unelear  whieh,  if  any,  ehaperones  will  be  neeessary,  but 
GroEE/ES  (El  Hage  et  al.,  2001;  Kerner  et  al.,  2005)  are  likely  eandidates  (Table  Al). 
The  only  known  examples  of  required  post-translational  modifications  for  the  proteins  in 
Table  Al  are  the  reeently  diseovered  methylations  of  translation  release  factors  1  and  2 
catalyzed  by  release  factor  Gin  methylase  (Table  Al)  (Heurgue-Hamard  et  al.,  2002; 
Nakahigashi  et  al.,  2002).  Other  possibilities  include  ribosomal  protein  acetylations. 
Mass  speetral  eomparisons  between  proteins  made  in  the  purified  system  and  those  made 
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in  vivo  will  expose  modifieations  and  also  assess  fidelity,  while  the  inactivity  of  a  protein 
of  expected  mass  would  suggest  a  protein-folding  deficit  and  the  need  for  an  additional 
chaperone.  Any  necessary  missing  components  could  be  identified  biochemically  by 
mixing  with  fractionated  crude  extracts  or  through  genetics. 

Compartments  and  division 

Membranes  would  allow  evolution  without  serial  transfers  and  purifications,  extension  of 
the  system  to  new  environments,  and  better  modeling  of  cells.  On  the  other  hand, 
membranous  boundaries  are  unnecessary  for  directed  evolution  (Mills  et  ah,  1967)  or,  in 
theory,  self-replication.  Membranes  also  restrict  applications  (e.g.  delivery  of  unnatural 
amino  acyl-tRNAs,  selection  schemes  based  on  binding  and  spatial  arraying  for 
nanofabrication).  Addition  to  self-replicating  macro  molecules  of  lipids  alone  may  be 
sufficient  for  encapsulation  of  the  macromolecules  within  bilayer  membrane  vesicles, 
synthetic  cell  division  and  transmembranous  small  molecule  transport  (Szostak  et  ah, 
2001).  The  choice  of  lipids  is  wide  open,  but  one  should  not  underestimate  the  challenges 
involved  in  working  with  them  (Luisi,  2002)  nor  the  advantages  in  regulation  to  be 
gained  by  adding  membrane-modeling  proteins  (e.g.  pores,  transporters  and  the  yet-to- 
be-discovered  complement  of  cell  division  proteins  (Gitai,  2005)). 

Integrating  the  subsystems 

How  might  all  of  the  biochemical  subsystems  in  Figure  A1  be  combined  to  generate  a 
self-sustaining  system?  This  is  clearly  a  new  level  of  complexity  in  comparison  with 
prior  self-assembly  projects.  None  of  the  subsystems  described  above  are  completed,  yet 
their  selection  is  based  on  a  reasonable  plan  for  their  ultimate  integration.  The  approach 
again  would  be  stepwise,  and  there  are  many  possible  pathways  that  could  be  integrated 
in  parallel  (Figure  Al).  For  example,  transcription  by  T7  RNA  polymerase  couples  well 
with  a  purified  E.  coli  translation  system  (Shimizu  et  ah,  2005).  Theoretical  integration  of 
DNA  synthesis,  RNA  synthesis  and  RNA  processing  was  discussed  above  (Figure  A2). 
These  four  different  subsystems  could  then  be  combined  to  synthesize  part  of  a  fifth 
system  (the  ribosome)  by  synthesis  of  an  antibiotic-resistant  16S  rRNA  and  His-tagged 
versions  of  all  21  small  subunit  ribosomal  proteins  (Tian  et  ah,  2004).  The  products  of 
these  integrated  subsystems  could  then  be  assayed  for  correct  in  vitro  reconstitution  of 
small  ribosomal  subunits  by  (i)  selecting  for  resistance  of  protein  synthesis  to  the 
antibiotic,  and  (ii)  detecting  the  presence  of  tagged  proteins  in  purified  small  ribosomal 
subunits  by  Western  blot  with  anti-His  antibodies.  As  another  example,  rudimentary 
vesicles  encapsulating  replicating  systems  (e.g.  Q  replicase)  were  shown  to  be  capable  of 
multiplication  (Luisi,  2002). 

Numerous  fine-tuning  strategies  can  be  envisioned.  Relative  strengths  of  DNA  promoters 
and  mRNA  ribosome-binding  sites  for  different  genes  could  be  modeled  on  the  in  vivo 
strengths,  with  necessary  adjustments  of  synthetic  rates  (and  thus  concentrations  of 
products)  achieved  by  mutations  in  the  binding  sites.  Additional  modules  might  be 
useful,  such  as  catabolism  (nucleases  and  proteases),  active  conversion  or  removal  of 
waste  products  (e.g.  by  energy  regenerating  enzymes  or  membrane  transporters)  and 
regulatory  feedback  (e.g.  excess  transcription  ->  excess  T7  lysozyme  mRNA  ->  excess 
lysozyme  ->  lysozyme  binding  to  and  inhibition  of  T7  RNA  polymerase).  Control  of 
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macromolecular  concentrations  will  be  aided  by  in  silico  modeling  and  design  (Tomita  et 
al.,  1999).  Given  that  the  subsystems  discussed  above  were  seleeted  with  integration  in 
mind  by  ehoosing  physiologieal  reaetion  eonditions  and  homologous  eomponents,  and 
given  that  additional  subsystems  eould  always  be  borrowed  from  living  eells  as  needed 
{e.g.  E.  coli  RNA  polymerase  and  regulatory  modules  sueh  as  riboswitehes  (Isaaes  et  ah, 
2004)),  it  seems  likely  that  this  approaeh  will  eventually  produee  synthetie  self¬ 
rep  lieation  and  ultimately  a  self-sustaining  minimal  eell. 

It  is  important  to  note  that  a  minimal  eell  would  be  intentionally  fragile.  For  example,  the 
vesiele  would  be  easily  lysed  and  the  small  moleeule  feeding  mix  would  be  highly 
speeialized  indeed  (ineluding  unstable  eofaetors  sueh  as  N-5,10-methenyltetrahydro folate 
and  S-adenosylmethionine).  These  built-in  safety  features  will  prevent  a  minimal  eell 
from  replieating  outside  the  laboratory.  However,  some  or  all  of  the  synthetie  genes  for  a 
MCP  would  be  intentionally  passaged  through  living  eells  for  eonstruetion  of 
reeombinant  DNA  elones  and  for  amplifioation.  Constantly  upgraded  ethieal  and  safety 
regulations  in  plaee  for  existing  biohazards  would  also  eneompass  this  researeh  (Cho  et 
ah,  1999);  http://arep.med.harvard.edu/SBP/Chureh_Biohazard04e.htm . 


3.2.2  Completion 

In  eonelusion,  a  stepwise  bioehemieal  approaeh  lends  itself  to  the  eventual  identifieation 
of  any  remaining  funetions  essential  for  the  synthesis  of  a  minimal  eell  sustained  solely 
by  small  moleeules.  Five  states  of  eompletion  present  themselves  as  traetable  goals  of  a 
MCP.  Namely,  the  identifieation  of: 

(1)  the  genes  listed  as  missing  in  Table  Al, 

(2)  any  additional  genes  and  organization  neeessary  experimentally  for  minimal  eell 
synthesis, 

(3)  any  dispensable  genes, 

(4)  bioehemieal  parameters  and  eomputational  models  suffieiently  detailed  to  prediet  the 
effeets  of  alterations,  and 

(5)  the  missing  three-dimensional  struetures  of  the  gene  produets  and  their  relevant 
eomplexes. 

It  is  diffieult  to  prediet  how  long  it  will  take  to  debug  eaeh  of  the  individual  bioehemieal 
subsystems  or  to  put  them  all  together,  so  it  is  important  to  bear  in  mind  that  there  are 
short-term  goals.  Intermediate  assembly  steps  eould  also  be  pursued  while  the  gaps  in 
RNA  modifieation  knowledge  (Table  Al)  are  being  filled.  For  example,  the  projeet  to 
assemble  a  ribosome  under  physiologieal  eonditions  eould  be  earried  out  without  the 
missing  23 S  rRNA  modifieation  enzymes  (Table  Al)  by  substituting  in  natural  23 S 
rRNA.  Similarly,  assembly  of  self-replieation  in  the  absenee  of  funetional  in  vitro- 
synthesized  tRNA  substrates  eould  be  earried  out  using  eellular  total  tRNA  to  enable 
self-replieation  from  substrates  (rather  than  just  small  moleeules)  as  a  major  step  towards 
understanding  biologieal  self-replieation.  This  would  also  allow  direeted  evolution  of  all 
of  the  eomponents  exeept  the  tRNAs  in  a  more  flexible  manner  than  is  possible  in  vivo 
{e.g.  for  seleeting  ribosome  mutants  that  ineorporate  unnatural  amino  aeids  more 
effieiently). 
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The  biochemical  subsystems  necessary  for  a  MCP  are  central,  old  fields  that  have  lost 
impetus.  Completion  within  a  decade  will  only  be  possible  through  a  coordinated  filling 
of  the  key  gaps  in  knowledge  by  the  cutting-edge  laboratories  scattered  around  the  world 
in  these  fields.  It  will  also  require  stimulation  of  rate-limiting  fields.  For  example,  though 
rRNAs  and  tRNAs  can  constitute  more  than  70%  of  the  dry  weight  of  a  cell,  half  of  the 
estimated  60-70  RNA  modification  enzymes  of  E.  coli  and  one  fifth  of  the  tRNAs  remain 
to  be  characterized,  despite  the  recent  completion  of  about  300  bacterial  whole  genome 
sequences.  The  momentum  of  genomics  and  consequent  deluge  of  computed  hypotheses 
cries  out  for  comparable  breakthroughs  in  experimental  tests.  Synthetic  systems  biology 
projects  such  as  a  MCP  promise  such  tests  with  the  added  bonus  of  new  applications. 
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+  except  1,21 

+ 

Large  subunit  5S  ribosomal  RNA 

+ 

+ 

Large  subunit  23S  ribosomal  RNA 

+ 

+ 

Large  subunit  23S  rRNA  G2445>m2G  methylase:  unidentified 

unknown 

- 

Large  subunit  23S  rRNA  U2449>dihydroU  synthetase:  unidentified 

unknown 

- 

Large  subunit  23S  rRNA  U2457>pseudoU  synthetase 

unknown 

- 

Large  subunit  23S  rRNA  C2498>Cm  methylase:  unidentified 

unknown 

- 

Large  subunit  23S  rRNA  A2503>m2A  methylase:  unidentified 

unknown 

- 

Large  subunit  23S  rRNA  U2504>pseudoLI  synthetase 

unknown 

- 

All  33  large  subunit  ribosomal  proteins  (1-7,9-11,13-25,27-36) 

+  except  25,  30 

+ 

Translational  initiation  factor  1 

+ 

+ 

Translational  initiation  factor  2 

+ 

+ 

Translational  initiation  factor  3 

+ 

+ 

Translational  elongation  factor  Tu 

+ 

+ 

Translational  elongation  factor  Ts 

+ 

+ 

Translational  elongation  factor  G 

+ 

+ 

Translational  release  factor  1 

+ 

+ 

Translational  release  factor  2 

- 

+ 

Translational  release  factor  Gin  methylase 

+ 

+ 

Translational  release  factor  3 

- 

+ 

Ribosome  recycling  factor 

+ 

+ 

33/45  Transfer  RNAs  (see  Fig.  2) 

Set  of  29 

+ 

tRNA  C34>lysidine  synthetase 

unidentified 

+ 

tRNA  A34>l  deaminase 

unidentified 

+ 

tRNA  LI34>cmo5LI  (=V)  synthetases:  unidentified 

- 

- 

tRNA  LI34>2sLI  Cys  desulfurase 

- 

+ 

tRNA  LI34>2sLI  synthetase 

unidentified 

+ 

tRNA  LI34>cmnm5LI  GTPase 

unidentified 

+ 

tRNA  LI34>cmnm5LI  synthetase 

unidentified 

+ 

tRNA  cmnm5LI34>nm5U>mnm5LI  synthetase 

unidentified 

- 

tRNA  G37  N1 -methylase 

+ 

+ 

tRNA  A37>t6A  N6-threonylcarbamoyl-A  synthetase:  unidentified 

unidentified 

- 

tRNA  A37>i6A  synthetase 

- 

+ 

tRNA  i6A37>s2i6A>ms2i6A  synthetase 

- 

+ 

All  22  aminoacyl-tRNA  synthetase  subunits  (20  enzymes) 

+  except  Gly  sub.,  Gin 

+  except  Gly  sub.,  Ala 

Met-tRNA  formyltransferase 

+ 

+ 

Chaperonin  GroEL 

+ 

+ 

Chaperonin  GroES 

+ 

+ 

151  genes  =  38  RNAs  +  113  proteins 

Table  Al.  Biochemically-derived  list  of  genes  that  may  encode  a  nsefnl,  near-minimal,  self- 
replicating  system  dependent  only  on  small  molecnle  nntrients. 

Gaps  in  knowledge  are  in  yellow.  Left  column;  chosen  gene  products  and  DNA  sites. 
Middle  column:  relationship  to  the  minimal  genome  of  Mycoplasma  genitalium;  clear 
sequence  homolog  =  known  enzyme  product  without  an  evident  sequence  homolog 
=  “unidentified”;  no  functional  homolog  =  Right  column:  high-resolution,  three- 
dimensional,  structural  information;  >25%  of  the  structure  solved  =  “+”,  <25%  = 
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Figure  Al.  A  minimal  cell  containing  biological  macromolecules  and  pathways  proposed  to  be 
necessary  and  sufficient  for  replication  from  small  molecule  nutrients. 

The  macromolecules  are  all  nucleic  acid  and  protein  polymers  and  are  encapsulated 
within  a  bilayer  lipid  vesicle.  The  small  molecules  (brown)  diffuse  across  the  bilayer. 
The  macromolecules  are  ordered  according  to  the  pathways  in  which  they  are  synthesized 
and  act.  They  are  colored  by  biochemical  subsystem  as  follows:  blue  =  DNA  synthesis, 
red  =  RNA  synthesis  and  cleavage,  green  =  RNA  modification,  purple  =  ribosome 
assembly,  orange  =  post-translational  modification,  and  black  =  protein  synthesis.  MFT  = 
methiony  1-tRN A^®*i  formy Itransferase . 
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Figure  A2.  A  generalizable,  physiologically-compatible  tbeoretical  schema  for  accurate  DNA 

replication  and  RNA  synthesis  in  vitro. 

Polymerase  movements  are  illustrated  by  colored  arrow  heads.  DNA  synthesis,  A  nicked 
double-stranded  DNA  circle  (middle)  undergoes  rolling-circle  DNA  synthesis  by 
coliphage  phi29  DNA  polymerase  (Dahl  et  ah,  2004)  to  give  an  oligomeric  single- 
stranded  DNA  (bottom,  blue).  RNA  primers  (red)  then  hybridize  at  two  sites  to  prime 
lagging  strand  DNA  synthesis  (bottom,  green).  When  two  Lox  sites  (bottom,  L)  are 
completed,  recombination  occurs  between  them  catalyzed  by  coliphage  PI  Cre 
recombinase  (black  cross)  to  form  a  duplicate  of  the  original  circular  template.  RNA 
synthesis.  The  circular  genetic  operon  (middle)  contains  a  promoter  for  T7  RNA 
polymerase  (P),  a  ribosomal  RNA  (rRNA)  gene,  two  transfer  RNA  (tRNA)  sequences,  a 
self-cleaving  hammerhead  sequence  (H),  and  a  T7  terminator  (T).  RNA  synthesis  from  P 
generates  a  precursor  RNA  (top,  red)  containing  three  cleavage  sites  (thin  black  arrows). 
The  second  tRNA  sequence  merely  serves  as  a  recognition  site  for  RNase  P  cleavage. 
Cleavages  yield  the  mature  rRNA  and  tRNAi.  Any  cleavage  product  containing  a  3' 
hydroxyl  group  or  primary  RNA  transcript  can  serve  as  a  primer  for  DNA  synthesis 
(bottom,  red). 
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Figure  A3.  All  nucleoside  modifications  of  all  33  synthetic  tRNAs  that  may  he  sufficient  for  accurate 

translation. 

Outside  (shaded):  mRNA  codons  of  the  genetic  code  are  illustrated  in  the  standard 
format,  except  that  the  3’  U  and  C  are  switched  to  simplify  depiction  of  decoding.  Inside: 
tRNA  nucleotides  34-37  (from  5’  to  3’)  and  their  cognate  amino  acids.  Nucleotides  34-36 
are  the  anticodons,  and  nucleotides  37  are  represented  by  black  superscripts.  Codon  and 
anticodon  positions  that  base  pair  with  each  other  are  colored  similarly.  Stop  codon 
specificities  of  release  factor  (RF)  proteins  are  included.  The  portions  of  the  tRNA 
sequences  not  shown  in  the  figure  are  unmodified.  Expected  modifications  of  in  vitro 
transcripts  by  the  enzymes  in  Table  Al,  and  expected  amino  acid  and  codon  specificities 
are  given.  *  =  unspecified  modification,  _  =  unknown  modification  status,  ms  i  A  =  2- 
methylthio-N^-isopentenyladenosine,  m*G  =  1-methylguanosine,  t6A  =  N^- 
threonylcarbamoyladenosine,  cmnm^U  =  5-carboxymethylaminomethyluridine,  V  = 
cmo  U  =  uridine  5-oxyacetic  acid,  I  =  inosine,  cmnm  s  U  =  5- 
carboxymethylaminomethyl-2-thiouridine,  k2C  =  lysidine,  S  =  mnm  s  U  =  5- 
methylaminomethyl-2-thiouridine,  mnm5U  =  5-methylaminomethyluridine. 
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Transcription  by  RNA  polymerase  from  E.  coli  versus  coliphage  T7 
Initiation 

All  E.  coli  genes  are  transeribed  by  E.  coli  RNA  polymerase,  given  the  appropriate  sigma 
subunit  (Lewin,  2004).  For  a  MCP  with  the  target  genes  in  Table  Al,  use  of  natural 
promoters  with  the  rpoD  sigma  subunit  would  probably  maintain  the  natural  relative 
initiation  rates,  although  this  needs  to  be  eonfirmed.  Any  promoters  that  didn't  funetion 
(perhaps  due  to  the  laek  of  an  uncharacterized  factor)  could  then  be  "fixed"  by 
substitution  with  working  promoters.  Natural  regulatory  proteins  could  also  be  added  to 
provide  control  mechanisms.  However,  cryptic  initiation  sites  are  anticipated  for  E.  coli 
RNA  polymerase  but  not  for  the  more  selective  T7  RNA  polymerase.  The  latter 
polymerase  could  be  regulated  by  the  binding  of  T7  lysozyme  and  by  promoter  binding 
by  lac  or  other  repressors  (Studier  et  ah,  1990).  Relative  initiation  rates  for  T7  RNA 
polymerase  could  be  set  using  T7  promoters  of  different  strengths,  although  such 
promoters  dictate  the  first  few  bases  of  the  transcripts  (usually  pppGGG...).  Nevertheless, 
this  restriction  in  5'-terminal  sequence  does  not  appear  to  be  problematic  for  the  synthesis 
of  any  of  the  RNAs  needed  for  a  MCP  (see  below). 

Elongation  and  termination 

T7  RNA  polymerase  transcribes  about  five  times  faster  than  the  E.  coli  one,  altering  the 
relative  E.  coli  rates  of  coupled  transcription  and  translation,  although  this  works  well  for 
phage  T7  infections  and  when  over-expressing  genes  in  vivo.  Termination  by  both 
enzymes  is  inefficient,  so  tandem  terminators  should  be  tried.  T7  has  the  important 
advantage  of  high  processivity  through  essentially  any  sequence  until  it  reaches  its 
natural  terminator  (class  I  or  II  (Lyakhov  et  ah,  1998)).  The  E.  coli  polymerase 
terminates  prematurely  within  genes  containing  an  anti-termination  signal  {e.g.  found  in 
rRNA  genes)  if  the  anti-terminator  factor(s)  fail  to  act.  Since  the  number  of  different  anti- 
termination  mechanisms  in  E.  coli  is  unknown,  the  extent  of  anti-termination  is  also 
unknown.  Though  premature  termination  could  be  detected  easily  in  MCP  experiments 
and  might  be  overcome  in  some  cases  by  omission  of  transcription  termination  factors, 
other  cases  would  require  altering  codon  bias  with  the  hope  that  the  natural  premature 
termination  signal  (sometimes  an  unknown  sequence)  is  destroyed  without  otherwise 
affecting  transcription  or  translation. 

The  cycle 

The  transcription  cycle  of  E.  coli  is  more  complex  than  T7's  because  it  requires 
association  and  dissociation  of  the  sigma  factor. 

tRNA  modifications 

The  roles  of  the  tRNA  modifications  are  controversial.  The  genetic  approach  almost 
always  finds  a  particular  tRNA  modification  enzyme  to  be  dispensable  (Bjork,  1995),  and 
even  the  enzyme  that  synthesizes  the  universal  U  to  T  modification  at  position  54  is  only 
essential  due  to  a  function  separable  from  tRNA  modification  (Persson  et  ah,  1992).  The 
biochemical  approach  finds  unmodified  tRNAs  to  be  active  in  translation  systems 
(Cornish  et  ah,  1995;  Harrington  et  ah,  1993),  although  careful  comparisons  between 
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individual  unmodified  tRNA  transcripts  and  their  modified  counterparts  (either  purified 
natural  isoacceptors  or  chemically-synthesized  tRNAs  (Wang,  1984))  are  limited.  Thus, 
the  hypothesis  that  the  modifications  are  unimportant  is  widely  held.  But  the  source  of 
tRNAs  for  all  in  vitro  protein  syntheses  are  cellular  total  tRNAs  (because  so  many 
different  tRNAs  are  required),  so  attempting  protein  synthesis  with  in  vitro-synthesized 
tRNAs  will  be  helpful  in  testing  this  hypothesis.  The  contrary  view,  that  several  tRNA 
modifications  will  be  of  key  importance  for  a  MCP,  seems  most  likely.  The  comparative 
approach  argues  for  their  essentiality  (Bjork,  1995).  Genetics  rarely  rules  out 
pseudorevertants  (suppression  by  secondary  mutations  (Gutgsell  et  ah,  2001))  and  has 
recently  identified  a  few  essentials  (Bjork  et  ah,  2001;  Soma  et  ah,  2003;  Wolf  et  ah, 
2002)  (not  to  mention  the  potential  for  pairwise  lethals).  Biochemical  assay  interpretation 
should  be  tempered  by  the  presence  in  crude  translation  extracts  of  endogenous 
modification  activities  (Samuelsson  et  ah,  1988)  and  the  paucity  of  assays  performed  in 
pure  charging  and  translation  systems  (Harrington  et  ah,  1993). 

What  modification  enzymes  can  be  predicted  to  be  essential  for  self-replication? 
Nucleotide  34  in  the  anticodon  wobble  position,  and  nucleotide  37  directly  3’  of  the 
anticodon,  contain  the  most  complex  (hyper-)  modifications,  and  all  of  the  most  likely 
essentials.  Charging  by  aminoacyl-tRNA  synthetases  in  E.  coli  requires  mnm5s2U34  in 
tRNA-Lys  and  tRNA-Glu  (and  perhaps  the  related  modification  in  tRNA-Gln),  t6A37  in 
tRNA-Ilel,  and  lysidine34  in  tRNA-Ile2  (Bjork,  1995;  Giege  et  ah,  1998).  The  latter  is 
the  only  known  example  of  a  modification  acting  as  an  anti-determinant  by  preventing 
mischarging  (with  Met  (Giege  et  ah,  1998)),  but  a  systematic  search  for  others  is  needed 
by  in  vitro  charging  of  unmodified  tRNAs  in  a  purified  system  containing  all  20 
aminoacyl-tRNA  synthetases.  Accurate  wobbling  during  codon  recognition  requires 
lysidine34,  mnm5s2U34  and  its  variants  (mnm5U34  and  cmnm5U34),  cmo5U34  and 
inosine34  (Curran,  1998;  Yokoyama  and  Nishimura,  1995).  The  active  anticodon  loop 
confirmation  of  tRNA-Lys  is  stabilized  by  direct  interaction  of  mnm5s2U34  and 
t6A37(Sundaram  et  ah,  2000).  mlG37  is  essential  to  prevent  frameshifting  (Bjork  et  ah, 
2001).  t6A37  and  msi6A37  (and  its  variant  i6A37)  stabilize  A-U  and  U-A  base  pairs  at 
the  N36  position  of  codon-anticodon  duplexes  by  stacking,  presumably  important  for 
increasing  translational  efficiency  at  these  codons  (Grosjean  et  ah,  1998). 


4  Quantitative  Proteomic  Software 


4,1,  Methods,  Assumptions,  and  Procedures 

Whole-cell  protein  quantitation  using  mass  spectrometry  has  proven  to  be  much  more 
challenging  than  mRNA  quantitation.  The  detection  efficiency  varies  significantly  from 
peptide  to  peptide;  the  molecular  identities  are  not  evident  a  priori  and  are  dispersed 
unevenly  throughout  the  multidimensional  data  space.  In  this  study  we  have  developed 
open-source  software,  called  MapQuant,  which  quantitates  all  organic  species  in  large 
mass  spectrometry  datasets 
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We  tested  MapQuant  on  Bovine  Serum  Albumin,  BSA  samples  in  21  Liquid 
Chromatography  /  Mass  Speetrometry,  LC/MS  experiments,  in  triplieate  at  seven 
different  eoneentrations  (7  -  5000  fimoles)  for  quantitation  purposes  (labeled  q- 
experiments)  and  two  LC/MS  experiments  for  identifieation  purposes  (labeled  s- 
experiments).  For  each  q-experiment,  MapQuant  generated  a  two-dimensional  map  of 
scans  against  m/z  bins.  Analysis  entailed  applying  algorithms  for  noise  filtering, 
watershed  segmentation,  peak  finding  and  fitting,  peak  clustering  and  isotopic-cluster 
deconvolution  and  fitting  using  binomially  distributed  clusters  of  gaussioid  peaks. 
MS/MS  spectra  were  interpreted  using  the  program  SEQUEST. 

Out  of  the  190  tryptic  peptides  used  for  the  quantitation,  172  were  identified  by 
SEQUEST.  Although  the  data  were  acquired  on  a  low  resolution  spectrometer, 
MapQuant  enabled  us  to  search  and  find  18  more  peptides  based  on  m/z  position  and 
charge  estimation.  These  190  tryptic  peptides  cover  94.85%  of  the  BSA  sequence. 

The  data  has  shown  evidence  of  linearity,  at  least  for  the  highly  abundant  peptides 
observed  in  the  range  of  7  to  1600  fimoles.  We  have  also  developed  a  model  for 
ionization  efficiencies  by  calculating  ionization  coefficients  for  each  amino  acid.  This 
model  gives  us  the  capability  to  describe  the  quantitation  level  of  BSA  as  a  whole 
protein.  The  applicability  to  quantitation  of  more  complex  mixtures  such  as  a  proteome 
appears  to  scale  linearly  with  number  of  peptides,  as  long  as  the  peak  overlap  density  is 
kept  at  a  low  level. 

With  the  capability  of  performing  whole-cell  proteome  analysis  (Eipton  et  al.  2002;  Jaffe 
et  al.  2004),  a  need  to  extend  this  process  to  quantitation  has  become  increasingly 
apparent.  Methods  for  measuring  the  state  of  an  organism’s  proteome  have  been 
successful  with  the  use  of  2-D  polyacrylamide  gel  protein  maps,  followed  by  spot 
excision,  digestion  of  the  protein  with  trypsin  and  peptide  sequencing  using  reversed- 
phase  liquid  chromatography  coupled  to  electrospray  ionization  mass  spectrometry  (ESI- 
MS)  (Gygi  et  al.  1999;  Pandey  and  Mann  2000).  Quantitation  of  peptide  mixtures  using 
only  chromatographic  separation  methods  coupled  to  mass  spectrometry  has  proven  to  be 
a  more  easily  automated  procedure  than  2-D  electrophoresis.  However,  quantitation  of 
proteins  in  complex  mixtures  using  the  signal  acquired  from  their  constitutive  tryptic 
peptides  has  become  a  very  desirable  and  challenging  goal.  The  reason  being  that  the 
detection  efficiency  varies  significantly  from  peptide  to  peptide;  the  molecular  identities 
are  not  evident  a  priori  and  are  dispersed  unevenly  throughout  the  multidimensional 
separations.  Although,  commercially  available  programs  are  available  for  quantitation, 
they  are  not  designed  for  high-throughput  proteomic  data  and  the  algorithms  used  in  them 
are  not  publicly  available. 
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Knowing  the  quantities  of  proteins  in  a  biologieal  system  is  erueial  in  understanding 
post-transeriptional  events  (Gygi  et  al.  1999;  Futeher  et  al.  1999)  ineluding  translational 
effieieney,  post  translational  modifieations,  eompartmentalization,  interaetions,  and 
turnover. 

In  this  study  we  proposed  a  data  analysis  method  utilizing  one  or  more  ehromatographic 
(or  eleetrophoretie)  separation  dimensions  and  a  mass  separation  dimension  provided  by 
the  mass  speetrometer.  Data  from  an  LC/MS  experiment  (hereafter  referred  to  as 
experiment)  ean  be  analyzed  after  being  formatted  into  a  data-strueture  ealled  a  2-D  map. 
A  2-D  map  is  essentially  a  matrix  whose  rows  and  columns  represent  scans  and  m/z  bins 
respectively.  The  2-D  map  of  a  tryptic  peptide  from  BSA  is  shown  in  Figure  Bl.  The 
separation  dimensions  are  considered  orthogonal  since  they  describe  two  independent 
properties  of  the  peptides:  mass  and  hydrophobicity.  For  this  reason  a  parallelism  can  be 
drawn  between  a  2-D  map  and  a  2-D  electrophoresis  gel,  where  in  the  latter  the  two 
dimensions  represent  mass  and  isoelectric  point.  The  advantage  of  this  visualization 
method  is  that  the  experimentalist  gets  a  global  view  of  the  type  of  species  eluting  from 
the  column  and  their  relative  position  to  each  other. 

Using  the  above  as  motivation  we  have  developed  open-source  software,  called 
MapQuant,  which  given  large  amounts  of  mass-spectrometry  data,  outputs  quantitation 
for  any  organic  species  in  the  sample.  We  will  discuss  features  of  the  software  including 
several  algorithms  used  in  it.  Furthermore,  we  have  applied  MapQuant  in  the  study  of 
BSA  samples  at  different  concentrations  and  tried  to  develop  a  peptide  ionization  model 
that  would  explain  the  abundances  observed  for  the  BSA  tryptic  peptides.  Analyses  of 
tryptic  peptides  of  BSA  have  been  carried  out  in  the  past  (Bruce  et  al.  1999;  Hirayama  et 
al.  1990),  however  without  any  attempts  for  absolute  quantitation  using  its  constituent 
tryptic  peptides.  Another  goal  of  this  study  was  to  set  the  ground  on  the  standardization 
of  storage  and  quantitation  of  mass  spectrometry  data  and  create  a  community  where 
investigators  can  share  their  algorithms  and  data  structures. 


4.1,1,  Data  Acquisition 

The  sample  used  in  this  study  was  a  BSA  digest  standard  from  Michrom  BioResources 
(910/00002/15).  Samples  were  injected,  with  a  Famos™  auto  sampler,  on  a  reversed- 
phase  column  coupled  to  a  Finnigan  LCQ™  DEC  A  XP+  mass  spectrometer.  The  column 
was  an  in  house  made  15  cm  x  75  pm  capillary  filled  with  Magic  Cig  resin. 

The  BSA  digest  was  diluted  in  95%  water,  5%  acetonitrile  and  0.1%  formic  acid  to  a 
final  volume  of  10  pL  for  each  of  twenty-three  experiments.  Twenty-one  of  them 
involved  seven  different  amounts  of  BSA  peptides  in  triplicate.  These  BSA  amounts 
include  7,  21.5,  66.67,  200,  500,  1600  and  5000  fimoles.  The  signal  acquisition  method 
for  these  experiments  was  carried  out  in  the  profile  mode  and  did  not  involve  any  MS/MS 
scans,  since  they  were  meant  for  quantitation  purposes  {q -experiments).  Moreover,  these 
q-experiments  were  run  from  low  to  high  concentrations  to  minimize  carryover  effects. 
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The  remaining  experiments  involved  two  5000  fmole  samples  that  were  run  with  a  signal 
aequisition  method  that  included  MS/MS  scans  for  sequencing  purposes  (s-experiments). 


4,1.2.  Data  Storage 

In  order  for  MapQuant  to  be  functional,  a  standard  for  storing  raw  LC/MS  experiment 
data  had  to  be  developed.  For  the  moment  we  are  calling  this  standard  OpenRaw  but  in 
the  meantime  we  are  in  the  process  of  developing  an  XML  version  of  it.  All  OpenRaw 
files  were  created  by  a  program  written  in  Visual  C++  using  the  API  provided  by  the 
company  Finnigan.  OpenRaw  file  format  has  the  advantage  of  being  readable  on  all 
computer  platforms  due  to  its  open-source  nature. 


The  raw  data  of  an  experiment  were  stored  in  three  functionally  distinct  folders  which 
themselves  were  within  a  parent  folder  named  after  the  LC/MS  experiment  (Figure  B2). 
These  folders  are:  (a)  a  global  parameters  folder,  (b)  an  MSI  spectra  archive  folder  and 
(c)  an  MS/MS  spectra  archive  folder. 


The  global  parameters  folder  holds  four  files.  The  size.param  file  stores  information 
about  the  size  of  the  data  of  an  LC/MS  experiment,  i.e.  total  number  of  scans  and  total 
number  of  mass  bins.  The  RTSA.param  file,  which  stands  for  Retention  Time  Sampling 
Array,  stores  information  about  all  the  time  points  at  which  each  mass  spectrum  was 
scanned.  The  MSSA.param  file,  which  stands  for  Mass  Sampling  Array,  stores 
information  about  the  spacing  of  the  sampling  points  in  the  m/z  dimension  given  by  the 
mass  spectrometer.  The  file  InstrumentMethod.param  stores  information  about  the 
instrument  method  used  by  the  user  at  that  particular  LC/MS  experiment. 


The  MSI  spectra  folder  contains  the  file  expmntjiame.msar .  The  extension  msar  stands 
for  mass  spectrum  archive  and,  as  the  name  indicates,  it  stores  the  ion-abundance  signal 
from  each  mass  spectrum  in  a  concatenating  manner. 


Similarly,  the  MS/MS  spectra  folder  stores  the  file  expmnt_name.ms2ar  which  is  a 
concatenation  of  the  ion-abundance  signal  for  all  MS/MS  spectra.  MS/MS  spectra  can  be 
analyzed  for  peptide  identification  by  a  sequencing  program. 


4,1,3,  Data  Analysis 

Strategy:  Identification  -  Quantitation 

Since  the  mass  accuracy  of  the  spectrometer  used  in  this  study  was  not  very  high,  we 
were  reluctant  to  base  peptide  identification  solely  on  the  m/z  of  the  peaks  observed. 
Instead,  we  used  the  well  established  strategy  of  acquiring  MS/MS  spectra  upon  peptide 
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fragmentation,  followed  by  sequeneing  using  commercially  available  software,  like 
SEQUEST  (Eng  et  al.  1994).  However,  if  we  had  chosen  to  perform  MS/MS  scans  in  our 
data  acquisition  scheme,  we  would  have  limited  the  number  of  MS  scans  that  would  have 
been  acquired,  hence  reducing  the  sampling  data  points  for  quantitation.  Eor  example,  a 
140  min  chromatographic  run  without  acquiring  any  MS/MS  spectra  would  yield  5881 
scans,  whereas  the  same  run  set  to  collect  five  MS/MS  scans  per  MS  scan  would  yield 
4433  scans,  of  which  about  one  fifth  would  be  the  MS  scan  used  to  reconstruct  the  2-D 
map.  To  circumvent  this  problem  we  collected  MS  data  with  and  without  MS/MS 
spectra.  These  runs  were  referred  to  as  s-experiments  and  q-experiments  respectively,  as 
described  in  the  data  acquisition  section.  The  strategy  of  how  we  linked  the  quantitative 
output  from  the  q-experiment  to  the  identification  output  of  an  s-experiment  is  outlined  in 
Eigure  B3. 

In  order  to  link  a  sequenced  MS/MS  event  to  a  fitted  peak,  the  following  algorithm  was 
implemented,  which  was  essentially  encoded  in  the  program  assignq2.  Eirst,  every  entry 
in  the  SEQEIEST  summary  file  (Eigure  B3)  reflects  a  sequencing  event  (MS/MS  scan) 
which  has  a  unique  retention-time  and  m/z  coordinates  in  a  2-D  map,  as  indicated  by  the 
blue  points  in  Eigure  B4.  The  sequence  for  these  MS/MS  scans  that  is  depicted  next  to 
the  blue  points  represent  the  highest  scoring  peptide  given  by  SEQUEST.  Isotopic 
clusters  that  were  identified  using  MapQuant  also  have  centroids,  represented  by  red 
points  in  Eigure  B4.  Eor  each  sequencing  event  a  rectangular  area  that  is  1  m/z  and  80 
scans  wide  in  each  dimension  was  searched  for  possible  MapQuant  peaks  that  lay  within 
it.  If  there  were  multiple  sequencing  events  that  were  assigned  the  same  peptide,  the 
peaks  captured  by  these  events  were  pooled  and  then  ranked  according  to  their  Euclidean 
distance  from  their  corresponding  sequencing  event.  Also  the  charge  predicted  by 
MapQuant  was  checked  for  agreement  with  the  charge  used  by  the  sequencing  program. 
If  the  charge  was  different  the  peak  was  rejected.  The  procedure  described  above  was 
used  for  all  21  q-experiments,  thus  creating  a  calibration  table  for  further  analyses. 

Definitions  and  Data  Structures 

Experiment  is  the  data  structure  that  holds  information  about  an  EC/MS  experiment. 
More  specifically  it  holds  information  on  the  sampling  of  the  eluent  at  different  time 
points.  Also  it  holds  information  about  the  sampling  in  the  m/z  dimension  used  by  the 
spectrometer. 

Scan  is  the  sampling  unit  in  the  chromatography  dimension 

Mass  b'm  is  the  sampling  unit  of  the  mass  spectrometer  when  measuring  the  m/z  of  the 
produced  ions. 

Map2D  is  the  data  structure  describing  a  2-D  map  as  defined  in  the  introduction.  It  is 
stored  in  the  form  of  matrix  (Eigure  B5). 

Mass  spectrum  is  defined  as  the  signal  acquired  at  particular  time  point.  It  can  be  thought 
of  as  a  row  (or  column)  of  the  Map2D  matrix  (Eigure  B5). 

Mass  chromatogram  is  defined  as  the  signal  present  at  a  fixed  mass  bin  point.  It  can  be 
also  thought  of  as  a  row  (or  column)  of  the  Map2D  matrix  (Eigure  B5). 
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Segment  map  is  defined  as  a  region  of  a  parent  2-D  map  to  which  the  operation  of 
segmentation  was  performed.  Segmentation  is  performed  to  partition  the  signal  (peaks)  in 
the  map  for  easier  analysis  (Figure  B6). 

Peak  is  defined  as  a  point  in  the  2-D  map  where  it  is  considered  to  be  a  local  maximum 
(as  defined  by  a  PeakF'mder  model) 

FittedPeak  (or  FPeak)  is  a  peak  that  has  been  fitted  to  a  particular  model  described  by  a 
mathematical  equation,  such  as  a  two-dimensional  Gaussian  (Figure  B6). 

Peak  group  is  the  cluster  of  fitted  peaks  that  can  represent  candidate  co-eluting  isotopic 
clusters  (fig.  6). 

Peak  group  map  is  defined  as  the  minimum  2-D  map  needed  for  fitting  the  estimated 
number  of  isotopic  clusters  that  a  peak  group  might  contain  (Figure  B6). 

Isotopic  cluster  is  a  group  of  peaks  that  represent  the  isotopic  variants  of  a  molecular 
species  (Figure  1). 

MapQuant 

MapQuant  is  a  program  designed  for  quantitation  of  data  from  an  LC/MS  experiment  that 
is  in  OpenRaw  format,  as  described  above.  The  quantitation  procedure  involved 
formatting  the  data  into  a  2-D  map  and  applying  several  algorithms  on  the  2-D  map  with 
the  goal  of  outputting  a  list  of  abundances  of  possible  isotopic  clusters.  Below,  we  are 
presenting  the  order  that  several  algorithms  were  applied  to  the  data  acquired  from  the 
BSA  samples  mentioned  in  the  data  acquisition  section. 

Since  the  data  were  quite  noisy,  especially  in  the  chromatography  dimension,  noise  filters 
were  applied.  More  specifically,  smoothing  algorithms  such  as  applying  a  moving- 
averaging  filter  were  applied  (Press  1992).  The  purpose  of  the  smoothing  algorithm  is  to 
facilitate  the  detection  of  all  local  maxima  (peaks)  found  in  the  2-D  map. 

The  second  step  in  the  data  processing  involved  segmentation  of  the  2-D  map  into 
smaller  areas  called  segments,  such  that  overlapping  peaks  were  confined  into  unique 
segment  maps  (Figure  B7).  The  purpose  of  the  segmentation  algorithm  was  to  minimize 
the  number  of  data  points  being  used  simultaneously  for  fitting.  It  also  made  sure  that 
overlapping  peaks  were  included  in  the  same  fitting  iteration. 

The  third  step  of  the  quantitation  process  involved  the  application  of  a  peak  detection 
algorithm  for  finding  the  positions  of  the  local  maxima  in  every  segment  map,  and  then 
using  that  information  as  initial  conditions  for  the  curve  fitting  algorithm. 

The  fourth  step  in  the  procedure  involved  single  linkage  clustering  of  the  fitted  peaks  into 
clusters  referred  to  as  peak  groups.  Peak  groups  represent  co-eluting  peaks  that  might 
include  one  or  more  possible  isotopic  clusters. 

Finally,  more  refined  fitting  was  performed  for  each  peak  group.  This  refining  algorithm 
involved  an  iteration  of  subtraction  and  residual  fitting  based  on  previously  estimated 
peak  widths.  This  algorithm  was  chosen  because  peptides  that  had  charges  of  +2  and  +3, 
had  isotopic  peaks  that  were  significantly  overlapping.  However,  peptides  with  a  +4 
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charge  were  impossible  to  resolve  by  this  method,  since  the  resolution  offered  by  the 
Finnigan  LCQ™  DECA  XP+  spectrometer  was  too  low. 

The  above  five  steps  were  ineluded  in  the  strueture  of  scripts  called  MQScripts  that  can 
be  read  by  MapQuanf  s  parser  called  MQParser.  MapQuant  algorithms  were  written  in 
ANSI  C  and  the  MQParser  was  written  using  the  general-purpose  parser  generator 
program  called  bison.  Moreover,  MapQuant  includes  minimal  visualization  features 
including  2-D  map  visualization  as  well  as  mass  spectrum  and  mass  chromatogram 
visualization. 

In  the  next  seetion,  some  of  the  most  important  funetions  reeognized  by  the  MQParser 
are  discussed  in  detail,  along  with  the  algorithms  they  encapsulate. 

Algorithms 

1 .  Smoothing  by  convolution 

2.  Watershed  segmentation 

3.  Peak  Finding  and  Peak  Fitting 

4.  Peak  Clustering 

5.  Refine  m/z  peaks  (Deeonvolving  by  fitting  and  subtracting) 

6.  Deconvolve  and  isotopic  carbon  peaks 

Smoothing  hy  convolution 

Convolution  is  the  equivalent  of  band  filtering  in  the  frequeney  domain  after  a  Fourier 
transform  has  been  applied  to  either  a  mass  ehromatogram  or  a  mass  spectrum.  The 
implementation  of  this  filter  utilizes  the  mathematical  property  shown  in  equation  1 
(Press  1992),  where  s  is  the  signal  array  and  h  is  the  impulse  array.  S  and  H  are  the 
Fourier  transforms  of  s  and  h  respeetively. 

s®h  =  F  \S.H)  (1) 

Related  Functions  in  MQScript 

Map2D  Map_ApplyFilter  (Map2D  map,  chav*  filter,  double  dim) 

The  string  filter  deseribes  the  impulse  funetion  that  will  be  applied  to  eaeh  mass  speetrum 
or  mass  chromatogram  of  the  2-D  map  "map".  The  value  of filter  ean  either  be  “BC_n”  or 
“GS  n  rn”  where  n  and  m  are  integers.  BC  refers  to  a  box-ear  filter  (otherwise  called 
moving-average  filter)  of  n  points  and  GS  refers  to  a  Gaussian  filter  of  standard  deviation 
n.m.  The  value  of  dim  ean  be  either  RT_DIMENSION  or  MZ  DIMENSION,  referring  to 
the  dimension  of  the  2-D  map  that  the  filter  will  be  applied. 

Watershed  segmentation 

The  goal  of  this  quantitation  analysis  is  to  fit  every  peak  in  the  2-D  map.  However,  since 
fitting  all  peaks  at  the  same  time  is  computationally  too  expensive,  a  strategy  involving 
compartmentalization  is  eonsidered.  This  is  achieved  by  segmenting  the  map  using  the 
watershed  segmentation  algorithm  (Vincent  and  Soille  1991). 
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Related  Functions  in  MQScript 

Map2D  Map  Segment(Map2D  map,  double  bm_size) 

This  function  takes  the  2-D  map  map  and  segments  it  using  the  watershed  algorithm.  The 
parameter  bm_size  is  used  to  denote  the  stepping  size  used  by  the  flooding  procedure  of 
the  watershed  algorithm. 

VariantArray  Map_GetSegmentArray(Map2D  map) 

This  function  returns  an  array  of  type  ‘segment’  which  defines  the  boundaries  of  a 
segment  map. 

Map2D  Segment_GetMap(Segment  segment,  Map2D  source,  Map2D  mask) 

This  function  extracts  a  segment  map  from  the  map  source,  using  a  map  which  acts  as  a 
mask.  Each  point  in  the  2-D  map  mask  holds  the  segment  number  in  which  it  belongs  to. 
This  procedure  is  illustrated  in  Figure  B7. 


Peak  Finding  and  Peak  Fitting 

After  compartmentalizing  the  parent  2-D  map  into  segment  maps,  the  goal  of  peak 
finding  and  peak  fitting  becomes  computationally  more  manageable.  The  peak  detection 
algorithm  used  here  uses  concepts  from  mathematical  morphology  such  as  a  structuring 
element  (Ritter  and  Wilson  2001).  A  structuring  element  can  be  considered  a  small 
binary  image  that  an  image  operator  can  take  as  input  along  with  the  image  of  interest. 
An  example  would  be  the  image  operation  of  erosion  (symbolized  by  between  the 
image-signal  S  and  the  structuring  element  N  that  would  act  as  a  kernel  (analogous  to  the 
response  element  used  in  convolution).  The  operation  mentioned  above  is  shown 
mathematically  in  equation  2. 

T  =  S  ^  N  (2) 

The  peak  detection  algorithm  uses  a  structuring  element  such  as  the  ones  in  Figure  B8,  in 
order  to  decide  which  neighboring  points  for  each  data  point  in  the  2-D  map  are  to  be 
included  in  the  image  operation.  The  shaded  point  indicates  the  point  of  reference  of  the 
structuring  element. 


In  the  operation  of  peak-detection,  a  data  point  in  the  2-D  map  is  considered  a  local 
maximum  only  if  its  value  is  larger  than  all  the  neighboring  points  defined  in  the 
structuring  element  as  1 .  Mathematically  speaking 


a,  = 


{ 


PY.f^{s„N,)}  =  \N 


0  otherwise 


(3) 


Where 


Hp,g)=  { 


r  lifp>=q 

1  0  otherwise 

and 

N 

Af|  =  XiV, 


N  represents  the  structuring  element  and  Sk  the  value  of  the  data  point  k  in  the  2-D  map  S. 
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To  avoid  detecting  pseudo-peaks  due  to  the  noisy  signal  an  abundance  threshold  is  also 
set  for  all  the  points  in  the  structuring  element.  The  abundance  threshold  is  usually  set  to 
m  +  2s,  where  m  is  the  median  of  the  map  and  s  is  the  average  absolute  deviation  from 
the  median. 


After  a  list  of  candidate  peaks  is  formed,  they  are  fitted  as  a  sum  of  curves  described  by  a 
mathematical  equation.  For  example,  if  in  a  segment  map  there  are  n  candidate  peaks, 
and  if  each  peak  is  chosen  to  be  fitted  as  curve  C,  then  the  whole  segment-map  would  be 


n 

Ic,. 


fitted  as  i  .  In  this  study  we  chose  to  fit  each  curve  with  a  double  Gaussian, 
i 

referred  to  from  now  on  as  the  gaussioid  curve,  i.e. 

A 


ir-Vof 


2(T, 


2cr^ 


(4) 


As  seen  from  the  equation,  the  number  of  parameters  to  be  fitted  per  peak  is  five, 
i.e.  abundance  A,  retention-time  centroid  ro,  mass-over-charge  centroid  mo,  the  standard 
deviation  of  the  Gaussian  in  the  retention  time  dimension  Qr  and  finally  the  standard 
deviation  of  the  Gaussian  in  the  mass-over-charge  Gm.  The  method  used  for  peak-fitting 
is  the  “non-linear  least  squares”  method  (Press  1992).  It  is  a  minimization  method  using 
steepest  descent.  It  requires  knowledge  of  the  first  derivative  for  each  of  the  parameters 
to  be  fitted.  For  example  if  there  are  n  candidate  peaks  in  a  segment  map  and  we  want  to 
fit  them  using  the  gaussioid  curve,  we  would  need  to  fit  5n  parameters  and  the  algorithm 
would  require  5n  partial  derivatives. 


Related  Functions  in  MQScript 

Segment  Segment_FindAndFitPeaks(Segment  segment,  Map2D  map,  char*  structel, 
double  abuthr,  double  noisefactor,  double  numberofsd,  char*  curve) 

This  function  finds  and  fits  peaks  in  the  2-D  map  map,  and  returns  them  inside  a  variable 
of  type  ‘Segment’.  The  arguments  used  by  the  peak- finding  part  of  the  algorithm  are 
structel  and  abuthr.  The  argument  structel  refers  to  the  structuring  element  used  in  the 
peak-finding  of  the  algorithm.  It  can  take  values  such  as  “N9x3E”,  “N3x3R”,  etc  (Figure 
B8).  The  argument  abuthr  denotes  the  threshold  value  which  needs  to  be  met  by  all  the 
points  in  the  structuring  element.  The  argument  curve  refers  to  the  peak-fitting  part  of 
the  algorithm  and  it  denotes  the  type  of  curve  that  the  peak  should  be  fitted  too.  It  can 
take  values  such  as  “NR  GAUSSIOD”  or  “NR  EM  GAUSSIOD”.  The  arguments 
noisefactor  and  numberofsd  are  not  used  anymore, 
double  Map_GetMedian(Map2D  map)  and 
double  Map_GetAvgAbsDevFromMedian(Map2D  map) 

The  function  Map  GetMedian  calculates  a  distribution  of  the  intensities  of  all  the  points 
in  the  2-D  map  map  and  returns  its  median.  Similarly  Map  GetAvgAbsDevEromMedian 
returns  the  average  absolute  deviation  from  the  median. 


Peak  Clustering 
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In  order  to  decide  which  fitted  peaks  belong  to  which  isotopic  clusters  we  cluster  the 
peaks  into  data  structures  called  peak  groups.  Peaks  belonging  in  an  isotopic  cluster  have 
restrictions  in  their  relative  position  in  the  isotopic  cluster.  More  specifically  isotopic 
peaks  have  to  be  co-eluting  (Figure  Bl).  Secondly,  the  peaks  cannot  be  separated  more 
than  1  m/z  unit,  which  is  maximum  distance  defined  by  peptides  with  charge  +1.  The 
above  natural  restrictions,  co-elution  and  the  1  m/z  maximum  distance  limit,  are  taken 
into  account  by  limiting  the  number  of  inter-peak  distances  when  peaks  are  clustered. 

Related  Functions  in  MQScript 

double  FPeakFile_ClusterAndSave(char*  szlnFilename,  char*  szOutFilename,  double 
scanjhr,  double  mzjhr,  double  abundjhr,  double  nLinkageType) 

This  function  reads  the  fpeaks  from  the  file  szlnFilename,  clusters  them  into  peak  groups 
and  then  outputs  them  into  the  file  szOutFilename.  The  arguments  scanjhr  is  used  to  set 
the  scan  tolerance  of  the  co-elution  restriction  mentioned  above.  Likewise  the  argument 
mzjhr  is  used  to  set  the  m/z  tolerance  for  the  maximum  allowed  distance  between  two 
isotopic  peaks.  This  value  was  set  to  1.1  m/z,  and  calculated  from  known  +1  peptides. 

Refine  m/z  peaks  (Deconvolving  by  fitting  and  subtracting) 

Due  to  the  fact  that  the  spectrometer  cannot  resolve  isotopic  peaks  of  peptides  that  have 
charges  of  +3  and  above,  we  apply  an  algorithm  following  a  ‘fit  and  subtract’  strategy. 
We  use  a  simple  test  for  discriminating  between  ‘oversized’  or  ‘undersized’  peaks  from 
‘normal’  peaks.  The  metric  applied  in  this  case  makes  the  assumption  that  peak  width  in 
the  m/z  dimension  is  constant  throughout  the  experiment.  The  pseudo  code  for  this 
algorithm  is  provided  in  the  supplementary  material. 

Related  Functions  in  MQScript 

PeakGroup  PeakGroup_Refine6(Peakgroup  peakgroup,  Map2D  PGMap,  VariantArray 
pPGMapFPeaks,  char*  structel,  double  numberofsd,  char*  curve) 


Deconvolve  and  fit  isotopic  carbon  peaks 

In  this  algorithm  we  assume  that  each  peak  group  may  represent  one  or  more  isotopic 
clusters.  So  we  devise  an  algorithm  for  sequential  prediction  of  each  isotopic  cluster  by 
going  through  the  fpeaks  based  on  increasing  m/z.  The  peaks  that  are  candidates  for  an 
isotopic  cluster  are  to  be  fitted  using  a  binomially  distributed  sum  of  two-dimensional 
Gaussians. 


{r-rj 
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Where 


B(i;c,p)  = 


fA 


(6) 


Related  Functions  in  MQScript 
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PeakGroup  PeakGroup  Deconv3(PeakGroup  peakgroup,  Map2D  PGMap,  VariantArray 
pPGMapFPeaks,  char*  curve); 

The  above  funetion  can  be  divided  into  two  parts  whieh  are  iterated  until  all  fpeaks  are 
distributed  into  isotopic  clusters; 

1.  Guess  the  most  likely  subset  of  the  fpeaks  in  a  peak  group,  whieh  can  comprise  a 
possible  isotopic  cluster. 

Substitute  those  peaks  with  a  binomially  distributed  gaussioid  curve  (denoted  with  the 
string  “NR_BD_GAUSSIOD”)  and  fit  the  peak  group  map  with  equation  7,  where  m  is 
the  number  of  single  gaussioid  curves  C  (eq.  4)  and  n  is  the  number  of  binomially 
distributed  gaussioid  curves  B  (eq.  5). 

m  n 

(7) 

i  i 


First  we  provide  a  deseription  of  the  first  part  of  the  algorithm  whieh  itself  is  eomprised 
of  two  steps.  As  a  first  step  it  involves  the  finding  of  all  possible  sets  of  fpeaks  whose 
m/z  forms  an  arithmetic  series,  as  shown  in  equation  8,  and  hence  might  provide 
evidence  of  belonging  to  the  isotopic  variant  peaks  of  a  charged  moleeular  species.  This 
algorithm  is  referred  to  a  charge  deconvolution  algorithm. 

a^=a^+d 
a2=  ao+  2d 
^3  =<d^+3d 


where  d  =  1/Z  ,  Z  is  the  charge  of  the  molecular  species  (8) 

Figure  B9  shows  a  possible  arrangement  of  fpeaks  in  a  peak  group  and  the  schematic 
description  of  the  charge  deconvolution  algorithm.  This  algorithm  extracts  sets  of  fpeaks 
that  could  belong  to  candidate  isotopie  elusters.  As  seen  in  Figure  B9,  for  each  fpeak 
belonging  to  the  peak  group,  whieh  itself  ean  be  represented  as  a  directed  acyelie  graph,  a 
seareh  is  being  performed  through  it  and  stored  in  a  tree.  For  example,  steps  2-4  describe 
the  creation  of  a  tree  whose  root  is  peak  A,  step  5  describes  the  ereation  of  the  tree  whose 
root  is  peak  B,  etc.  Each  tree  stores  all  possible  paths  that  link  the  root  with  every  leaf. 
The  paths  in  all  the  trees  represent  all  candidate  isotopic  clusters  for  the  peak  group  under 
study.  In  order  to  choose  the  most  plausible  isotopic  cluster  candidate  we  rank  these 
isotopic  clusters  according  to  the  sum  of  the  abundances  of  their  constituent  peaks.  In  this 
way,  low  abundant  moleeular  speeies  do  not  interfere  with  the  deeonvolution  process 
since  they  are  deeonvolved  later.  The  next  step  after  eharge  is  to  take  the  most  abundant 
eandidate  isotopie  cluster  diseussed  above,  and  to  try  to  estimate  a  possible  earbon 
eontent  by  eomparing  the  abundances  of  its  eomprising  peaks  to  the  binomial  distribution 
described  in  equation  6.  The  metrie  used  in  this  seleetion  is  the  average  square  deviation 
form  an  idealized,  hence  calculated  binomial  distribution  for  a  particular  carbon  content. 
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Similar  algorithms  for  carbon  deconvolution  have  been  reported  in  the  literature 
(Wehofsky  et  al.  2001). 

Finally,  after  the  first  part  of  the  algorithm  is  ealled  iteratively  and  the  number  of  isotopie 
elusters  represented  in  the  peak  group  is  found,  they  are  fitted  with  equation  7. 

4.2.  Results  and  Discussion 
Coverage 

The  BSA  sequenee  used  in  the  study  was  identified  by  running  SEQUEST  on  nine  BSA 
sequenees  extracted  from  the  National  Center  for  Bioteehnology  Information  (NCBI) 
database.  Erom  the  peptides  that  seored  (xeorr  >  2.0)  it  was  evident  that  the  24  amino 
aeid  leading  peptide  was  not  present  in  the  mature  form  of  the  BSA  used  in  the 
experiment,  implying  a  protein  of  583  amino  aeids  in  length.  The  sequenee  is  shown  in 
Eigure  BIO  and  referred  to  from  now  on  as  mBSA-A214T. 

SEQUEST  was  re-run  using  a  protein  database  that  included  mBSA-A214T  as  well  as  58 
trypsin  sequenees.  Enzyme  settings  were  set  to  none,  in  order  to  inerease  the  eonfidenee 
for  the  identity  of  the  tryptie  ones.  The  program  was  set  to  search  for  +1,  +2  and  +3 
charged  variants.  Moreover,  it  was  set  to  take  into  aeeount  possible  amino  aeid 
modifieations  sueh  as  lysine  and  arginine  carbamylation,  methionine,  histidine  and 
tryptophan  oxidation,  and  glutamine  N-terminal  loss  of  ammonia.  The  following  eharge 
speeifie  xeorr  eutoff  values  were  set:  3  for  +3  peptides,  2  for  +2  peptides  and  1  for  +1 
peptides.  The  peptide  eharge  variants  that  passed  the  eharge-speeifie  eross-eorrelation 
seore  thresholds  amounted  to  242.  Erom  these  peptide  eharge  variants  only  half  (121) 
were  fully  tryptic  on  both  termini.  By  observing  sequenees  of  the  non-tryptie  peptides  we 
eoneluded  that  an  enzyme  with  ehymotrypsin  aetivity  was  present  in  the  digestion 
mixture,  sinee  74  of  the  amino  aeids  found  at  the  C-terminus  of  the  restriction  site  were 
either  phenylalanine,  tyrosine  or  leucine  (Antal  et  al  2001).  However  we  hypothesize  that 
ehymotrypsin  activity  was  attributed  to  an  enzyme  that  was  co-purified  with  trypsin  and 
in  a  lesser  amount,  sinee  only  8  peptide  eharge  variants  were  fully  ehymo tryptie. 

Out  of  the  242  BSA  peptide  eharge  variants  70  were  regarded  as  false  positives  and  were 
rejeeted  beeause  no  signal  was  found  on  the  two-dimensional  maps.  The  remaining  172 
peptide  eharge  variants  eover  540  amino-aeid  residues  out  of  a  total  of  583  whieh 
eorresponds  to  92.62%. 

MapQuant  Performance  In  order  to  evaluate  MapQuant’s  performanee  we  try  to 
estimate  the  pereentage  of  SEQUEST  hits  that  have  been  linked  to  a  possible  MapQuant 
isotopie  eluster  of  peaks.  It  should  be  noted  that  the  program  performs  better  with  +2 
peptides.  An  obvious  reason  for  the  bias  mentioned  above  is  due  to  the  low  resolution  of 
the  speetrometer.  Eor  example  the  average  m/z  bin  size  was  about  1/15  (0.067)  m/z  units 
wide.  This  means  that  peaks  belonging  to  an  isotopie  eluster  of  a  +3  peptide  would  be 
only  5  bins  apart  given  an  average  peak  width  of  0.16  m/z  (2.4  bins).  This  makes  it 
extremely  diffieult  for  any  algorithm  to  resolve  the  peaks. 
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Moreover,  it  should  be  noted  that  the  laek  of  finding  an  isotopic  cluster  with  the  correct 
charge  does  not  imply  that  the  algorithm  did  not  find  any  peaks  that  were  in  the  vicinity 
of  the  MS/MS  event.  Another  reason  for  the  lack  of  finding  all  peaks  can  be  attributed  to 
the  way  the  two  experiments  (q-experiment  and  s-experiment)  were  chosen  to  be  aligned 
and  to  the  size  of  the  window  chosen  to  be  searched  around  the  MS/MS  events.  Thirdly, 
since  only  certain  peptides  give  high  enough  signal  to  be  detected  in  the  high 
concentrations  of  the  sample,  we  assume  that  a  significant  number  of  the  total  number  of 
peptides  would  be  detectable  in  the  low  concentration  data  points.  For  this  reason  we 
observed  that  the  percentage  cover  increased  for  the  four  most  concentrated  samples  (12 
points). 

Additional  Non-SEQUEST  Peptides  In  addition  to  the  172  peptide  charge  variants  we 
included  18  peptide  charge  variants  that  were  not  found  by  SEQUEST.  There  are  two 
possible  reasons  for  peptide  isotopic  clusters  present  on  a  2-D  map  not  to  be  identified  by 
SEQUEST.  One  reason  could  be  the  fact  that  an  MS/MS  spectrum  corresponding  to  a 
peptide  isotopic  cluster  is  not  interpretable  by  the  program.  Another  reason  could  be  the 
complete  absence  of  MS/MS  spectra  for  a  peptide  isotopic  cluster  due  to  the  difficulty  of 
sampling  MS/MS  spectra  for  a  2-D  map  densely  populated  by  peaks. 

These  18  non-SEQUEST  peptides  were  divided  into  four  groups  based  on  the  criteria  for 
accepting  the  identity  to  be  true.  The  first  group  included  seven  peptides  that  were  found 
by  MapQuant  after  analyzing  a  2-D  map  that  had  been  collected  on  a  high-resolution 
Eourier  Transform  Ion  Cyclotron  Resonance  Mass  Spectrometer).  A  second  group 
included  7  peptides  that  were  found  manually  due  to  the  presence  of  other  same-sequence 
charge  variants  that  co-eluted  with  them.  The  third  group  included  three  small  non¬ 
overlapping  tryptic  peptides  of  charge  +1,  that  were  covering  parts  of  the  protein  that 
were  not  covered  by  all  previous  peptides.  Einally,  the  last  group  included  a  peptide 
which  had  to  be  fitted  manually  because  of  its  marginal  position  in  the  2-D  map,  as  it  had 
the  lowest  m/z  of  all  peptides. 

These  new  non-SEQUEST  peptide  charge  variants  increased  their  total  number  to  190 
and  also  the  sequence  cover  of  BSA  to  94.85%. 

Amino  Acid  Modifications  and  N-Terminal  Cyclizations 

In  this  study  we  were  also  interested  in  finding  out  possible  modifications.  Among  the 
172  peptides  examined,  we  focused  on  the  following  modifications,  mainly  on  S- 
carboxymethylation  of  cysteines,  on  the  oxidation  of  methionine  and  histidine,  on 
carbamylation  of  lysine  and  arginine  and  on  neutral  loss  of  ammonia  from  N-terminal 
glutamine.  These  four  modifications  were  the  ones  set  for  SEQUEST  to  search  for.  In 
addition,  the  19  non-SEQUEST  peptide  charged  variants  described  above,  include  two 
cases  where  we  observe  neutral  loss  of  ammonia  and  one  case  of  arginine  modification. 
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Out  of  the  eight  eases  representing  neutral  loss  of  ammonia,  all  of  them  correspond  to 
peptides  whose  sequences  have  the  amino  acid  glutamine  at  their  N-terminus.  This  is 
consistent  with  the  formation  of  pyroglutamate  reported  in  the  literature  (Baldwin  et  al. 
1990). 

With  regard  to  carbamylation,  5  peptides  were  observed  to  have  carbamylated  lysines, 
and  4  peptides  were  observed  to  have  carbamylated  arginines  (supplementary  table  3). 
Peptide  K*QTALVELLK  indicated  that  lysine-548  was  carbamylated.  Moreover,  lysine- 
548  is  also  known  to  be  glycated  (Wada  1996),  indicating  a  sequence  hot  spot  for  attack 
by  acidic  molecules. 

Linear  response 

We  also  wanted  to  investigate  the  range  of  linear  response  of  an  ESI  mass  spectrometer, 
such  as  the  Einnigan  LCQ™  DECA  XP-H.  Although,  the  results  pertain  to  the  particular 
instrument  used  in  this  study,  our  long-term  goal  is  to  be  able  to  use  the  BSA  tryptic 
peptide  mix  as  a  calibration  standard  that  has  to  be  run  on  the  instrument  where  the 
proteomic  study  will  occur. 

In  this  study  we  focused  on  the  five  most  abundant  peptides,  which  had  been  confirmed 
by  SEQUEST  and  had  a  charge  of  +2.  The  theoretical  number  of  data  points  for  these  5 
peptides  is  equal  to  105.  Eor  the  graphs  shown  in  Eigure  Bll,  the  abundances  of  72 
isotopic  clusters  were  used,  which  included  59  automatically  found  by  MapQuant  out  of 
which  8  had  to  replaced  by  manually  fitted  isotopic  clusters,  and  13  completely  new 
manually  fitted  isotopic  clusters. 

As  we  see  from  figure  B12,  when  we  plot  the  logarithms  of  fimoles  vs.  ion  volume  units 
for  each  peptide  charged  variant  we  see  that  the  5000  fimole  point  is  an  outlier  as  it  shows 
evidence  of  saturation,  for  this  reason  we  use  the  data  points  from  7  fmoles  to  1600 
fmoles  for  fitting.  The  curve  used  to  fit  the  data  points  mentioned  above  is  of  the  form  y 
=  Ax".  The  correlations  observed  for  this  log-log  regression  are  very  close  to  1. 
Moreover,  parameter  n,  which  represents  a  measure  of  deviation  from  linearity,  is  very 
close  to  one  for  all  peptides,  especially  for  R.RPC#f  SAETPDETYVPK.A  (1.05439)  and 
R.KVPQVSTPTLVEVSR.S  (1.07880).  However,  further  investigation  for  the  reasons  of 
the  variation  of  parameter  n  should  be  considered  in  the  future. 
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Model  for  Ionization 

Finally,  we  thought  that  one  aspect  of  mass-spectrometry  that  MapQuant  could  be  used  to 
explore,  is  modeling  the  variability  of  ionization  of  different  peptides.  The  approach  we 
took  was  to  cluster  the  190  peptide  charged  variants  into  39  non-overlapping  peptide 
clusters,  shown  by  a  double  line  in  Figure  BIO.  The  abundance  for  each  peptide  cluster 
was  calculated  as  the  sum  of  the  abundances  for  all  its  constituent  peptide  charged 
variants.  This  was  done  only  for  the  1600  fimole  data  point  since  it  is  the  most  abundant 
data  point  in  the  linear  region.  In  this  way  we  can  assume  that  the  number  of  molecules 
introduced  into  the  mass  spectrometer  per  peptide  cluster  is  equal  and  thus  allowing  us  to 
proceed  into  formulating  a  model  for  explaining  why  the  signal  acquired  per  peptide 
cluster  is  not  equal  to  each  other. 

In  this  study  we  developed  a  linear  model  to  explain  the  ionization  variation  of  each 
peptide  cluster,  where  each  amino  acid  contributes  either  negatively  or  positively  to 
ionization  efficiency.  Similar  models  have  been  applied  to  the  prediction  of  the  retention 
time  of  peptides  in  reverse  phase  columns  (Chabanet  and  Yvon,  1992).  Since  it  is  a  linear 
model,  the  fact  that  we  sum  the  abundances  of  the  constituent  peptide  charged  variants 
for  each  peptide  cluster  is  compatible  with  the  model. 

In  order  to  calculate  the  coefficients  of  ionization  for  each  amino  acid  we  use  the 
equation  9,  where  X  is  the  matrix  amino  acid  composition  for  the  39  peptide 

Y  =  XP  +  P^  (9) 

clusters  and  P  is  1  x  20  vector  holding  the  ionization  coefficient  for  each  the  twenty 
known  amino  acids.  In  the  case  of  Y,  we  assume  that  there  exists  a  maximum  ionization 
value  for  the  peptide  clusters  that  ionize  well.  For  this  reason  we  normalize  the 
abundance  of  each  peptide  cluster  by  the  maximum  and  use  that  as  the  Y  vector  in  the 
regression.  The  regression  is  performed  using  the  statistical  toolkit  in  MATLAB  and  in 
particular  with  the  function  regress.  The  correlation  of  this  regression  is  0.7967,  with  as 
p-value  of  4.8  x  10'  .  The  regression  was  also  cross  validated  using  the  leave-one-out 
method  with  the  correlation  value  of  0.5476. 

Using  the  ionization  coefficients,  we  can  see  that  for  a  significant  number  of  amino  acids, 
their  ionization  coefficients  are  positive.  Ionization  coefficients  for  methionine  and 
tryptophan  can  be  considered  the  least  reliable  since  the  BSA  sequence  contains  only  4 
and  2  respectively.  Only  six  amino  acids  have  ionization  coefficients  that  are  less  than 
0.01.  At  least  for  two  of  them,  i.e.  proline  and  phenylalanine  hypotheses  could  be 
formulated  as  to  explain  their  negative  ionization  coefficients.  Although,  the  exact 
mechanism  of  ionization  in  the  gaseous  phase  is  not  known  (Cole  2000),  the  ionization 
coefficients  of  these  two  amino  acids  point  to  an  ionization  model  which  reflects  amino 
acid  ionization  properties  that  are  known  for  the  aqueous  phase. 

More  specifically,  proline  which  has  the  second  most  negative  ionization  coefficient  is 
the  only  proteinogenic  amino  acid  that  forms  a  tertiary  amide  when  it  is  part  of  a  peptide. 
This  means  that  it  does  not  have  a  hydrogen  on  the  amide  group  and  can  therefore  not  act 
as  a  hydrogen  bond  donor.  Another  amino  acid  that  has  a  negative  ionization  coefficient 
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is  phenylalanine.  Phenylalanine  is  the  only  proteinogenic  amino  acid  whose  side  chain  is 
both  aromatic  and  completely  non-polar.  It  is  also  known  that  aromatic  amino  acids 
contribute  negatively  to  the  pKa  of  acidic  and  basic  groups,  which  reflects  ability  to  be 
protonated. 

Moreover,  no  correlation  was  found  between  amino  acid  ionization  coefficients  and 
either  isoelectric  points  or  hydrophobicity  coefficients.  As  far  as  modifications  and  N- 
terminal  cyclizations  are  concerned  we  did  not  distinguish,  for  example,  between 
glutamine  and  pyroglutamate  and  neither  between  carbamylated  and  uncarbamylated 
lysines  or  arginines.  It  is  evident,  that  these  regression  results  call  for  the  collection  of 
more  datasets  similar  to  this  one,  but  for  different  proteins,  in  order  to  improve  the 
correlation  of  the  regression  and  the  confidence  of  the  ionization  coefficients. 

Concluding  Remarks 

One  purpose  of  this  study  was  to  promote  community  sharing  and  standardization  of 
quantitative  mass  spectrometry.  We  believe  that  MapQuant  is  an  excellent  beginning  for 
the  above  goal  as  it  is  an  open-source  and  versatile  software.  An  XML  version  of  the 
OpenRaw  data  format  will  facilitate  the  exchange  of  raw  mass  spectrometry  data. 
Furthermore,  we  have  ensured  that  MapQuant  can  be  compiled  and  run  on  both  Windows 
and  Linux  platforms.  Another  advantage  of  MapQuant  is  that  it  can  process  data  acquired 
on  a  Fourier  Transform  Ion  Cyclotron  Resonance  Mass  Spectrometer  (FTICR-MS). 

In  conclusion,  our  goal  was  to  be  able  to  apply  the  techniques  used  for  BSA  to  quantitate 
proteins  in  complex  mixtures  such  as  proteomes  of  microorganisms  that  have  small 
genomes.  In  this  study  we  show  that  tryptic  peptides  from  BSA  can  behave  linearly  and 
ionize  according  to  a  linear  ionization  coefficient  model  postulated  above.  We  believe 
that  BSA  or  other  proteins  for  that  matter  can  be  used  as  standards  either  external  or 
internal  for  calibration  of  different  types  of  mass  spectrometers,  including  ones  that  use 
quadrupole  ion  traps  (QIT),  linear  traps  or  cyclotron  resonance  cells. 
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Figure  Bl.  A  detailed  look  at  an  isotopic  cluster  as  it  is  visualized  in  a  2-D  map. 

Sections  of  the  2-D  map,  such  as  relevant  mass  spectrum  and  mass  chromatogram  are 
also  shown. 
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Figure  B2.  The  format  (file  tree  structure)  used  by  MapQuaut  to  store  raw  data 
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Figure  B3.  The  pipeliue  employed  to  liuk  au  s-experimeut  with  a  q-experimeut. 

Regarding  the  q-experiment  the  program  presented  in  this  report,  MapQuant,  is  used  to 
extract  fitted  peaks  from  the  raw  data.  On  the  other  hand,  SEQUEST  with  the  help  of  the 
program  extractdta,  provides  identification  information  for  each  dta  file.  Then  all 
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m/z  m/z 


SEQUEST  output  files  are  summarized  in  one  summary  file  using  the  script 
createseqsum.pl. 
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Figure  B4.  Mapping  between  sequenced  MS2  events  and  quantitated  isotopic  clusters. 
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MS2  events,  marked  by  blue  crosses,  correspond  to  the  peptide  sequence 
K.DLGEEHFK.G,  as  shown  in  the  2-D  map  extracted  from  an  s-experiment  (lower  map). 
For  each  MS2  event  a  160-scan  x  1-m/z  search  window  (in  this  illustration,  40  scans  on 
each  side  in  the  RT  dimension  and  0.5  m/z  on  each  side  in  the  m/z  dimension)  is  drawn  in 
the  2-D  map  extracted  from  a  q-experiment  (upper  map).  Quantitated  isotopic  clusters 
using  MapQuant  are  marked  by  red  crosses  and  are  labeled  by  their  integrated  intensities. 
Any  isotopic  clusters  found  within  the  windows  defined  by  the  MS2  events  are  assigned 
to  them. 
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Figure  B5.  Illustration  of  the  definitions  surrounding  the  concept  of  a  2-D  map. 

As  seen  in  the  figure,  a  2-D  map  can  also  describe  a  fraction  of  an  experiment,  indicated 
by  the  shaded  rectangle.  A  2-D  map  is  defined  by  scan  boundaries  (e.g.  175  -  675)  and 
by  mass  bin  boundaries  (e.g.  100  -  1240).  Any  column  of  a  2-D  map  is  defined  as  a  mass 
spectrum  at  a  particular  scan,  and  any  row  is  defined  as  a  mass  chromatogram  at  a 
particular  mass  bin.  Positions  of  data  points  in  a  2-D  map  can  be  addressed  in  three 
different  ways:  (a)  Using  global  sampling  coordinates,  where  position  is  given  in  scan 
and  mass  bin  units  that  refer  to  the  experiment  as  a  whole,  (b)  using  local  sampling 
coordinates,  where  position  is  given  in  scan  and  mass  bin  units  but  using  a  local  frame  of 
reference  and  (c)  using  real  number  coordinates,  where  position  in  the  2-D  map  is 
described  in  units  of  the  physical  quantities  that  the  sampling  points  refer  to,  i.e.  minutes 
and  m/z. 
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Figure  B6.  Illustration  of  data  structures  and  concepts  required  for  the  understanding  of  the 

algorithms. 

A  segment  map  is  a  2-D  map  that  contains  all  the  data  points  belonging  to  a  segment  as 
result  of  performing  the  operation  of  watershed  segmentation  on  a  parent  2-D  map.  A 
peak  group  is  defined  as  a  eluster  of  fitted  peaks  that  can  represent  candidate  co  eluting 
isotopie  elusters.  A  peak  group  map  is  the  minimum  2-D  map  needed  for  fitting  the 
estimated  number  of  isotopic  clusters  that  a  peak  group  might  contain. 


Figure  B7.  Operation  of  watershed  segmentation  on  a  2-D  map. 
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This  algorithm  is  utilized  to  divide  the  map  in  non-overlapping  regions  so  that  fitting 
individual  peaks  becomes  less  computationally  intensive.  The  product  of  segmentation  is 
a  2-D  map  called  labeled  map  where  each  data  point  is  given  a  segment  number  which  it 
belongs  to  (indicated  by  different  shades).  The  labeled  map  can  be  used  as  a  guiding 
mask  to  extract  the  data  points  needed  for  a  particular  segment,  thus  creating  a  segment 
map  as  described  in  Figure  B5. 


'l 

1 

1 

1 

l" 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

'll 

1 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

1 

1 

1 

1 

1 

1 

1 

1 

1 

N9x3E 


N9x3C 


N3x3R 


Figure  B8.  A  collection  of  different  structure  elements  used  for  different  map  operations,  such  as 

opening,  closing  and  peak  finding. 
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Figure  B9.  Schematic  way  description  the  charge  deconvolution  algorithm. 


The  algorithm’s  goal  is  to  go  through  all  peaks  that  belong  to  the  peak  group  and 
construct  a  tree  for  each  one  of  them,  which  will  represent  all  possible  equidistant  groups 
of  peaks  that  might  be  candidates  for  defining  an  isotopic  cluster.  For  building  the  first 
tree,  the  algorithm  starts  from  the  peak  with  the  lowest  m/z  (peak  A,  step  2)  and  searches 
and  picks  all  the  peaks  (peak  B  and  C)  that  in  relation  with  the  root  peak  (peak  A)  have 
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an  m/z  distance  that  is  compatible  with  a  possible  charge  state  of  an  isotopic  cluster  (+2 
for  peak  B  and  +1  for  peak  C. 
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Figure  BIO.  Tiling  of  the  observed  peptides  with  unique  sequences  on  the  mature  sequence  of  BSA. 

The  coverage  is  calculated  to  be  558  amino  acids  out  of  a  total  of  583,  which  amount  to 
94.85%.  The  possible  sites  for  trypsin  cleavage,  lysines  and  arginines  are  marked  by 
asterisks. 
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Figure  Bll.  Cumulative  iou  volume  for  each  amiuo  acid  iu  the  proteiu  (mature  BSA). 
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The  cumulative  ion  volume  for  each  amino  acid  is  calculated  as  the  sum  of  the  ion 
volumes  for  each  peptide  that  contains  that  particular  amino  acid.  The  histograms  show 
that  certain  areas  of  the  protein  are  apparently  more  ionizable  than  others.  However,  this 
variation  might  also  be  due  to  unidentified  peptides. 
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Figure  Bll  (continued) 
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Figure  B12.  Calibration  curves  for  the  five  most  abundant  +2  peptides. 

The  curves  were  generated  using  the  abundances  of  72  isotopic  clusters,  quantitated  by 
MapQuant,  either  in  a  supervised  (51)  or  an  unsupervised  manner  (21). 
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5  Metabolic  modeling  Software 

Identifying  metabolic  enzymes  with  multiple  types  of  association  evidence 

Existing  large-scale  metabolic  models  of  sequenced  organisms  commonly  include 
enzymatic  functions  which  can  not  be  attributed  to  any  gene  in  that  organism.  We  present 
a  novel  method  for  identifying  such  missing  genes  based  on  a  local  structure  of  metabolic 
network  and  multiple  types  of  functional  association  evidence,  including  clustering  of 
genes  on  the  chromosome,  similarity  of  phylogenetic  profiles,  gene  expression,  protein 
fusion  events  and  others.  Using  E.  coli  and  S.  cerevisiae  metabolic  networks,  we 
illustrate  the  predictive  ability  of  each  individual  type  of  association  evidence  and  show 
that  significantly  better  predictions  can  be  obtained  based  on  the  combination  of  all  data. 
In  this  way  our  method  is  able  to  predict  60%  of  enzyme-encoding  genes  of  E.  coli 
metabolism  within  the  top  10  (out  of  3551)  candidates  for  their  enzymatic  function,  and 
as  a  top  candidate  within  43%  of  the  cases.  Our  approach  does  not  rely  on  direct 
sequence  homology  to  known  enzyme-encoding  genes,  and  can  be  used  in  conjunction 
with  traditional  homology-based  metabolic  reconstruction  methods. 

Supplementary  materials  are  available  at: 

http://arep.med.harvard.edu/kharchenko/identification/supplements.html . 


5,1,  Motivation 

Comprehensive  and  accurate  reconstruction  of  the  metabolic  networks  remains  an 
important  problem  for  both  newly  sequenced  and  well-studied  organisms  (Borodina  et  al. 
2005;  Reed  et  al.  2003).  The  challenges  posed  by  the  experimental  determination  of  the 
metabolic  enzymes  have  led  to  development  of  computational  methods  for  metabolic 
reconstruction.  The  most  common  approach  is  to  identify  genes  encoding  a  specific 
metabolic  enzyme  by  establishing  sequence  homology  to  functionally  characterized 
enzymes  in  other  species  (Tatusov  et  al.  1996).  Although  such  sequence  homology 
methods  have  been  remarkably  successful  overall,  they  fail  to  identify  enzymes  encoded 
by  genes  with  poor  sequence  homology  to  known  metabolic  enzymes,  and  result  in 
partially  reconstructed  metabolic  networks.  The  problem  of  identifying  genes  encoded  for 
a  specific  metabolic  function  in  such  partially  reconstructed  networks  has  been  referred 
to  as  the  “missing  gene”  problem  (Osterman  and  Overbeek  2003). 

Computational  strategies  for  identifying  missing  metabolic  genes  rely  on  refined 
sequence  homology  analysis  (Green  and  Karp  2004;  Reed  et  al.  2003)  and  consideration 
of  functional  association  evidence  linking  candidate  genes  with  known  enzyme-encoding 
genes  (Osterman  and  Overbeek  2003).  For  example,  PathwayTools  hole-filler  developed 
by  Green  et  al.  (Green  and  Karp  2004),  prioritizes  candidates  obtained  from  an  initial 
sequence  homology  search  by  using,  among  other  factors,  information  on  whether  the 
candidate  gene  is  located  adjacent  to,  or  in  the  same  transcriptional  unit  as  known 
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enzyme-encoding  genes  of  related  metabolic  function.  In  some  cases,  strong  genome 
context  association  evidence,  such  as  clustering  of  genes  on  the  chromosome,  or  co¬ 
occurrence  of  genes  in  phylogenetic  lineages,  has  played  a  key  role  in  identifying 
metabolic  genes  in  several  organisms  (Bishop  et  al.  2002;  Bobik  and  Rasche  2001). 

An  extensive  set  of  tools  has  been  developed  to  detect  and  catalog  general  pair-wise 
functional  associations  between  genes  based  on  a  combination  of  genome  context 
methods  and  other  evidence,  such  as  co-expression  or  protein  interactions  (Bowers  et  al. 
2004;  von  Mering  et  al.  2003).  Combinations  of  heterogeneous  association  evidence  have 
been  used  for  general  functional  inference  (Troyanskaya  et  al.  2003),  prediction  of 
protein  complexes  (Asthana  et  al.  2004;  Jansen  et  al.  2003;  Yamanishi  et  al.  2004)  and 
synthetic  lethal  interactions  (Wong  et  al.  2004).  A  recent  work  by  Yamanishi  et  al. 
(Yamanishi  et  al.  2005)  relied  on  a  combination  of  genomic,  mRNA  expression  and 
localization  evidence,  together  with  information  on  chemical  compatibility  to  reconstruct 
metabolic  pathways  from  known  metabolic  enzymes. 

In  an  earlier  study  we  described  a  method  for  identifying  missing  enzyme-encoding 
genes  based  on  gene  co-expression  and  local  structure  of  metabolic  network  (Kharchenko 
et  al.  2004).  The  candidate  genes  for  encoding  a  missing  metabolic  enzyme  were 
evaluated  based  on  the  overall  similarity  of  their  expression  profile  with  the  expression  of 
the  metabolic  network  neighborhood  of  the  missing  enzyme  (Figure  Cl  a).  The  local 
property  of  gene  co-expression,  which  formed  the  basis  of  this  method,  was  also 
observed  for  other  types  of  functional  associations,  in  particular  for  associations 
established  by  genome  context  (Kharchenko  et  al.  2005).  In  this  work  we  showed  that 
such  an  approach  can  be  extended  to  identify  metabolic  enzyme-encoding  genes  from  a 
number  of  different  types  of  functional  association  evidence,  including  phylogenetic 
profile  co-occurrence,  physical  clustering  of  genes  on  the  chromosome  and  protein 
interaction  data.  We  noted  that  the  presented  method  does  not  rely  on  sequence 
homology  to  known  enzymes,  and  its  predictions  are  complementary  to  the  traditional 
methods  of  metabolic  reconstruction. 

We  illustrated  the  performance  of  each  individual  type  of  association  evidence  by  testing 
how  well  the  method  is  able  to  predict  known  enzyme-encoding  genes  of  E.  coli  (Reed  et 
al.  2003)  and  S.  cerevisiae  (Forster  et  al.  2003)  metabolic  models  (see  Methods).  A  set  of 
candidate  genes,  containing  all  non-metabolic  genes  in  an  organism,  is  evaluated  and 
prioritized  by  calculating  overall  association  with  the  neighborhood  of  the  missing 
metabolic  enzyme  (Figure  Clb).  To  assess  the  performance  of  our  method  we  relied  on  a 
self-rank  measure,  which  is  the  rank  of  a  known  enzyme-encoding  gene  among  the  set  of 
candidates  prioritized  for  its  own  metabolic  function  (see  Methods).  We  developed 
techniques  for  combining  multiple  types  of  association  evidence  and  show  that 
significantly  better  prediction  performance  can  be  achieved  based  on  combined 
association  evidence. 
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5,2,  Results  and  discussion 


Similarity  of  phylogenetic  profiles 

A  number  of  earlier  studies  have  explored  using  patterns  of  gene  co-oecurrence  or 
absence  in  the  phylogenetic  lineages  to  infer  functional  association  between  gene  pairs 
(Huynen  and  Bork  1998;  Pellegrini  et  al.  1999).  The  basic  premise  of  the  method  is  that  a 
function  is  likely  to  be  encoded  by  several  associated  genes;  therefore  lineages 
maintaining  only  some  of  these  genes  will  have  lower  evolutionary  fitness.  For  instance, 
enzymes  catalyzing  successive  steps  of  a  linear  metabolic  pathway  are  likely  to  be 
present  together  in  an  organism  relying  on  that  metabolic  pathway,  and  absent  together 
from  an  organism  that  does  not  require  that  pathway. 

A  phylogenetic  profile  of  a  given  gene  on  a  set  of  Ng  genomes  can  be  encoded  as  binary 
string  of  length  Ag,  with  each  position  marking  presence  (1)  or  absence  (0)  of  an 

ortholog  in  the  corresponding  genome.  Functional  association  between  a  pair  of  genes  is 
assessed  by  the  degree  of  similarity  of  their  phylogenetic  profiles.  A  number  of  different 
distance  measures  have  been  used  to  calculate  such  similarity,  including  Hamming  string 
distance,  mutual  information  and  hypergeometric  distribution  (Bowers  et  al.  2004; 
Pellegrini  et  al.  1999,  Huynen,  2000  #546;  Wu  et  al.  2003).  We  find  that  the  performance 
of  different  distance  measures  is  very  similar.  These  profile  similarity  measures  do  not 
take  into  account  variable  degree  of  divergence  between  genomes  comprising  the 
orthology  dataset.  This  is  particularly  clear  in  the  case  of  Hypergeometric  distribution 
measure  (Bowers  et  al.  2004;  Wu  et  al.  2003),  which  assumes  that  ortholog  occurrences 
are  independently  and  identically  distributed  across  the  set  of  included  genomes  (see  5.3 
Methods). 

The  identity  assumption  would  suggest  that  the  total  number  of  ortholog  occurrences 
within  each  genome  should  be  approximately  the  same,  and  the  distribution  of  the 
number  of  orthologs  should  form  a  single,  narrow  peak  around  an  average  ortholog 
number.  The  empirical  distribution,  however,  is  quite  different  from  the  expected  form, 
lacking  a  peak  around  the  mean,  and  showing  substantial  density  over  almost  an  entire 
range  of  ortholog  numbers.  When  the  identity  assumption  is  relaxed,  profile  similarity 
probability  is  described  by  the  Extended  Multivariable  Hypergeometric  distribution 
(Harkness  1965).  Because  probability  functions  of  this  distribution  have  not  been  derived 
in  a  closed  form,  we  developed  a  numerical  algorithm  for  estimating  these  probabilities 
(see  5.3  Methods). 

Bias  stemming  from  the  violation  of  the  independence  assumption  can  be  minimized  by 
exclusion  or  reduction  of  closely  related  species  in  the  ortholog  occurrence  dataset.  We 
developed  an  approach  similar  to  previously  published  work  (von  Mering  et  al.  2003), 
which  reduces  the  bias  by  folding  together  phylogenetic  branches  containing  closely 
related  species,  and  using  an  ortholog  occurrence  pattern  based  on  the  agreement  within 
the  folded  branch  (see  5.3  Methods). 
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The  effect  of  both  corrections  on  the  ability  to  predict  enzyme-encoding  genes  in  E.  coli 
is  illustrated  by  the  cumulative  self-rank  distributions  (see  5.3  Methods)  in  Figure  C2a. 
The  extended  hypergeometric  distribution  correction  for  the  variable  genome  divergence 
from  E.  coli  target  genome  (violation  of  the  identity  assumption)  provides  a  noticeable 
improvement  in  prediction  performance  (8%  at  self-rank  threshold  of  50).  On  the  other 
hand,  the  folding  method  correcting  for  variable  divergence  with  the  set  of  query 
genomes  (violation  of  independence  assumption)  does  not  significantly  improve  the 
results. 

The  phylogenetic  profile  co-occurrence  method  depends  on  identification  of  orthologous 
genes  across  potentially  diverse  lineages.  Existing  investigations  have  used  a  variety  of 
methods,  including  readily  available  Clusters  of  Orthologous  Groups  (COG)  database 
(Tatusov  et  al.  2001),  (von  Mering  et  al.  2003;  Wu  et  al.  2003),  closest  homologs 
(Bowers  et  al.  2004),  and  best  bi-directional  homology  pairs  (Huynen  et  al.  2000).  The 
results  presented  in  our  work  rely  on  two  alternative  sets  of  orthology  data.  The  first  set 
comes  from  KEGG  SSDB  database  (Itoh  et  al.  2004),  and  includes  closest  homologs  and 
best  bi-directional  hits  as  determined  by  the  Smith  and  Waterman  algorithm  (we  will 
refer  to  it  as  KEGG-based  dataset).  The  second  set  was  constructed  based  on  results  of 
BLAST  (Altschul  et  al.  1990)  queries  against  a  “non-redundant”  set  of  known  protein 
sequences  maintained  by  NCBI  (see  Methods).  The  set  also  includes  information  on 
reverse  BLAST  searches  to  determine  best  bi-directional  hits  (referred  to  as  BLAST- 
based  dataset). 

Predictive  performance  of  different  orthology  datasets  is  compared  in  Eigure  C2b.  We 
note  that  coverage  of  the  COG  orthology  data  is  biased  towards  genes  encoding  known 
metabolic  enzymes,  and  the  self-rank  performance  of  this  dataset  was  estimated  by 
normalizing  with  respect  to  the  non-metabolic  gene  coverage.  Eigure  C2b  shows  that 
profile  associations  calculated  using  BLAST-based  dataset  provide  better  predictions  of 
enzyme-encoding  genes  than  association  based  on  the  KEGG  orthology  dataset.  We  also 
find  that  in  the  case  of  both  datasets  better  performance  is  attained  when  using  best  bi¬ 
directional  homology  pairs  instead  of  closest  homologs. 

As  a  consequence  of  gene  duplications,  metabolism  contains  a  significant  number  of 
paralogous  enzyme  pairs  (Maltsev  et  al.  2005).  In  many  cases,  such  enzymes  continue  to 
catalyze  the  same  reactions.  Such  pairs  will  frequently  have  similar  or  identical  orthology 
mappings,  and  their  inclusion  can  lead  to  a  significant  bias  in  estimation  of  the  predictive 
performance.  The  results  presented  in  this  work,  therefore,  exclude  self-ranks  of  any 
metabolic  enzymes  that  have  high  sequence  homology  to  any  other  metabolic  enzyme  in 
the  organism  (see  5.3  Methods). 

Co-expression  of  orthologous  genes 

The  approach  for  identifying  enzyme-encoding  genes  based  on  the  similarity  of  mRNA 
expression  profiles  (Kharchenko  et  al.  2004)  can  be  extended  to  include  co-expression 
information  of  orthologous  genes  in  other  organisms.  Conservation  of  mRNA  co¬ 
expression  across  different  species  has  been  investigated  by  a  number  of  recent  studies 
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(Bergmann  et  al.  2004;  Snel  et  al.  2004;  Teichmann  and  Babu  2002;  van  Noort  et  al. 
2003).  For  example,  analysis  of  eo-expressed  gene  pairs  between  S.  cerevisiae  and  C. 
elegans  shows  statistieally  signifieant  (P  value  <  10'  )  level  of  eonservation  (van  Noort  et 
al.  2003).  Although  the  number  of  pairs  with  highly  eonserved  eo-expression  is  small, 
incorporating  ortholog  co-expression  can  provide  significant  improvements  to  the 
accuracy  of  functional  predictions  based  on  the  mRNA  expression  data  (Teichmann  and 
Babu  2002;  van  Noort  et  al.  2003). 

We  find  that  enzyme-encoding  gene  predictions  based  on  the  co-expression  of  E.  coli 
orthologs  in  S.  cerevisiae  achieve  good  performance  on  the  enzymes  covered  by  the 
dataset.  Although  S.  cerevisiae  orthologs  can  be  identified  for  only  40.1%  of  E.  coli 
metabolic  genes,  combining  native  and  ortholog  co-expression  scores  provides  noticeable 
improvements.  Combination  of  native  and  ortholog  co-expression  increases  the  fraction 
of  metabolic  enzymes  predicted  within  the  top  50  candidates  from  27%  to  36%. 
Similarly,  using  E.  coli  expression  data  improves  prediction  results  for  enzyme-encoding 
genes  of  S.  cerevisiae  metabolic  network.  Overall  self-rank  performance  based  on 
combined  co-expression  data  is  included  in  Figure  C4. 

Clustering  of  genes  on  the  chromosome 

Relative  positions  of  genes  on  the  chromosome  have  also  been  successfully  used  to  infer 
functional  associations.  Most  notably,  analysis  of  prokaryotic  genomes  focused  on 
identifying  pairs  of  orthologs  located  close  to  each  other  on  the  chromosome,  as  well  as 
sets  of  such  pairs  (Overbeek  et  al.  1999;  von  Mering  et  al.  2003;  Yanai  et  al.  2002).  Such 
clustering  is  also  observed  in  the  eukaryotic  genomes,  even  though  they  lack  well- 
defined  operon  structures.  A  recent  study  by  Lee  et  al.  (Lee  and  Sonnhammer  2003) 
analyzed  clustering  of  genes  in  KEGG  pathways  for  5  distant  eukaryotic  species.  The 
study  demonstrated  that  depending  on  the  genome,  30%  to  98%  of  the  pathways  exhibit 
statistically  significant  levels  of  gene  clustering  on  the  chromosome.  A  variety  of 
methods  have  been  developed  for  identifying  chromosome  gene  clusters  and  evaluating 
their  significance  (Durand  and  Sankoff  2003).  To  generate  association  scores  we  use  a 
simple  statistical  evaluation  strategy  based  on  the  chromosome  gene  order,  which  allows 
for  computationally  efficient  treatment  of  large  number  of  genomes  (see  Methods). 

The  self-rank  performance  based  on  the  chromosome  clustering  association  is  shown  in 
Figure  C4.  The  overall  performance  for  known  E.  coli  metabolic  enzymes  is  better  than 
for  the  S.  cerevisiae  enzymes,  which  is  expected  given  the  prominent  role  of  operons  in 
prokaryotic  transcriptional  regulation. 

Other  association  measures 

Interacting  proteins  encoded  by  separate  genes  in  some  species,  may  sometimes  occur  as 
a  single,  multi-domain  fusion  protein  in  other  species.  Detecting  fusion  of  non- 
homologous  proteins  in  another  organism  has  been  shown  to  be  a  significant  predictor  of 
functional  association  between  genes  (Enright  et  al.  1999;  Marcotte  et  al.  1999;  Yanai  et 
al.  2001).  Our  calculations  of  a  fusion  association  score  are  based  on  a  combination  of 
fusions  detected  at  several  sequence  homology  thresholds.  The  overall  performance  of 
the  method  is  included  in  Eigure  C4.  Although  protein  fusion  associations  are  only  able 
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to  predict  relatively  small  fraction  of  enzyme-encoding  genes  (18%  for  E.  coli),  almost 
all  of  predicted  enzymes  are  returned  within  the  top  20  candidates. 

A  number  of  metabolic  reactions  are  catalyzed  by  well  established  protein  complexes, 
such  as  the  phosphofructokinase  complex.  Furthermore,  metabolic  processes  commonly 
involve  interactions  between  multiple  metabolic  enzymes.  For  instance,  the 
phosphofructokinase  alpha  subunit  encoded  by  Pfkl  also  interacts  with  a  product  of 
Fbal,  fructose-biphosphate  adolase  II,  catalyzing  an  adjacent  reaction  in  the  glycolysis 
pathway  (Matic  et  al.  2001).  Large  protein-protein  interaction  datasets  have  been 
generated  by  studies  using  yeast  two-hybrid  systems  (Ito  et  al.  2000;  Uetz  et  al.  2000) 
and,  more  recently,  mass  spectrometry-based  techniques  (Gavin  et  al.  2002;  Ho  et  al. 
2002).  In  the  framework  of  our  approach,  candidate  genes  can  be  evaluated  by  assessing 
the  overall  amount  of  interactions  between  a  candidate  gene  and  the  metabolic  network 
neighborhood  of  a  missing  enzyme.  To  assess  confidence  of  individual  interactions,  our 
analysis  makes  use  of  the  probabilistic  protein  interaction  dataset  from  Jansen  et  al. 
(Jansen  et  al.  2003),  which  combines  results  of  four  high-throughput  interaction  datasets 
(Gavin  et  al.  2002;  Ho  et  al.  2002;  Ito  et  al.  2000;  Uetz  et  al.  2000)  .  The  performance  of 
our  prediction  method  on  the  protein  interaction  data  is  significantly  lower  than  that  of 
other  association  scores,  nevertheless  it  is  above  what  is  expected  from  a  random 
association  score  (Figure  C4b). 

Functional  association  can  also  be  assessed  through  similarity  of  deletion  mutant 
phenotypes  under  a  large  set  of  environmental  conditions.  For  example,  deletions  of 
genes  that  are  adjacent  to  each  other  in  a  linear  metabolic  pathway  are  likely  to  result  in 
identical  mutant  phenotypes.  A  recent  work  by  Dudley  et  al.  (Dudley  et  al.  2005) 
experimentally  measured  growth  phenotypes  of  4710  S.  cerevisiae  mutants  under  21 
experimental  conditions,  including  different  carbon  sources,  nutrient  limitations,  stress 
and  others  conditions.  We  tested  the  performance  of  our  prediction  algorithm  on  a  set  of 
53  known  metabolic  enzyme-encoding  genes  for  which  high-confidence  data  was 
available.  While  the  results  illustrate  predictive  power  of  phenotypic  profde  associations, 
overall  contribution  of  this  score  to  the  predictions  of  unidentified  enzyme-encoding 
genes  is  very  small.  This  is  expected,  because  available  high-confidence  phenotypic  data 
covers  only  14%  of  S.  cerevisiae  genes. 

Overall  enhancements  of  the  individual  association  scores 

Description  of  a  metabolic  network  neighborhood  can  be  enhanced  by  considering 
relative  strength  of  metabolic  connections  established  by  different  metabolites. 
Metabolites  connecting  many  enzyme-encoding  genes  pairs  establish,  on  average,  weaker 
functional  associations  (Kharchenko  et  al.  2005).  The  performance  of  our  predictive 
method  can  be  improved  by  weighting  the  contribution  of  each  neighbor  in  evaluating  the 
overall  association  of  a  candidate  gene  with  the  metabolic  network  neighborhood  of  a 
missing  enzyme.  The  weight  is  assigned  according  to  the  total  number  of  enzyme  pairs 
associated  with  a  connecting  metabolite  (see  5.3  Methods). 

Distributions  of  association  scores  between  a  given  gene  and  all  other  genes  in  an 
organism  tend  to  differ  from  one  gene  to  another.  For  instance,  a  gene  whose  orthologs 
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can  be  identified  in  many  organisms  will  typieally  have  more  high-eonfidenee 
chromosome  elustering  assoeiations  than  a  gene  with  relatively  few  deteeted  orthologs. 
This  introduees  bias  when  evaluating  overall  assoeiation  with  a  metabolie  network 
neighborhood.  The  assoeiation-rank  resealing  reduees  this  bias  by  translating  raw 
assoeiation  seores  into  probabilities  of  metabolie  adjaeeney,  ealeulated  based  on  the  rank 
of  raw  assoeiation  seore  within  a  distribution  of  all  scores  for  a  partieular  gene.  The 
resealing  proeedure  also  reduees  the  number  of  false  positives  by  eonsidering  raw 
assoeiation  seore  of  a  gene  pair  with  respect  to  organism-wide  score  distributions  of  both 
genes  and  ehoosing  a  more  eonservative  adjaeeney  probability  value. 

The  predietive  performanee  of  all  assoeiation  seores  is  improved  by  either  eorreetion, 
with  the  exeeption  of  the  protein  fusion  seore,  where  applieation  of  metabolite  weighting 
results  in  weaker  performanee. 

Predictions  based  on  combined  association  evidence 

Enzyme-encoding  gene  predictions  based  on  the  individual  association  scores  can  be 
combined  to  achieve  better  performance.  Normalizing  relative  strength  of  different 
association  scores  requires  informative  priors.  Such  priors  can  be  either  constructed 
manually,  for  example  by  consulting  experts  (Troyanskaya  et  al.  2003),  or  learned  from 
known  test-cases.  This  problem  has  been  extensively  considered  with  respect  to 
confidence  in  pair-wise  gene  functional  associations,  and  test  cases  for  learning  the  priors 
were  based  on  known  functional  groupings,  such  as  GO  annotations  (Lee  et  al.  2004)  or 
membership  in  KEGG  pathways  (von  Mering  et  al.  2003).  Eor  the  current  problem  of 
prioritizing  enzyme-encoding  gene  candidates,  such  priors  can  be  learned  from  known 
enzyme-encoding  genes  (Green  and  Karp  2004). 

Towards  the  goal  of  integrating  multiple  types  of  association  evidence,  we  have 
developed  two  distinct  methods.  The  first  approach  is  based  on  a  direct  likelihood-ratio 
(DLR)  evaluation  of  the  association  score  probability  distributions.  The  likelihood  that  a 
given  candidate  gene  encodes  the  desired  metabolic  enzyme  is  calculated  under  the 
simplifying  assumptions  that  individual  association  scores  are  independent  and 
mono  tonic.  The  mono  tonic  assumption  states  that  for  every  association  score,  the 
likelihood  of  association  increases  monotonically  with  the  absolute  value  of  the  score. 
Both  assumptions  allow  for  useful  approximations,  but  in  general  can  be  shown  to  be 
incorrect.  Eor  example,  clustering  of  genes  on  the  chromosomes  in  E.  coli  is  statistically 
significantly  correlated  with  the  similarity  in  expression  profiles  (Spearman  rank 
correlation  P  value  <  10’'°),  violating  the  independence  assumption.  The  DLR  method 
calculates  overall  likelihood  ratio  of  a  candidate  gene  encoding  the  desired  enzyme  as  a 
product  of  likelihood  ratios  for  each  individual  association  score  (see  Methods). 

The  second  approach  uses  a  general  machine  learning  method  called  Adaboost  (Ereund 
and  Schapire  1997;  Schapire  2002),  and  does  not  rely  on  independence  or  monotonicity 
of  the  association  scores.  The  generated  classifiers  are  in  the  form  of  alternating  decision 
trees  (ADT),  which  are  generalization  of  decision  stumps,  decision  trees,  and  their 
combination  (Ereund  and  Mason  1999).  In  addition  to  flexible  semantic  representation, 
ADT-based  classifiers  provide  a  real-valued  measure  of  confidence,  called  classification 
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margm,  which  can  be  related  to  the  probability  of  a  given  elassifieation  being  eorreet 
(Sehapire  et  al.  1997).  The  Adaboost  method  has  been  sueeessfully  applied  to  several 
large-seale  biologieal  problems,  ineluding  deteetion  of  transeription  faetor  binding  motifs 
and  predietion  of  regulatory  response  (Middendorf  et  al.  2005;  Middendorf  et  al.  2004). 

We  find  that  in  identifying  missing  metabolie  genes  both  ADT  and  DLR  methods  aehieve 
eomparable  levels  of  performanee  (Figure  C3).  The  ADT  method  performs  slightly  better 
on  E.  coli  metabolie  enzymes  and  DLR  on  S.  cerevisiae.  Sueeess  of  the  DLR  method 
relative  to  a  general  elassifier,  sueh  as  ADT,  suggests  that  the  derived  assoeiation  seores 
are  largely  eonsistent  with  the  underlying  assumptions  of  monotonieity  and 
independenee,  and  allow  quality  predietions  to  be  made  based  on  a  straightforward 
evaluation  of  the  seore  probability  distributions.  The  ADT  method,  however,  does  not 
require  such  assumptions,  and  may  be  used  to  ineorporate  in  the  future  a  wide  variety  of 
unrestricted  descriptors,  such  as  sequence  homology  data  or  expression  variability 
(Kharehenko  et  al.  2004). 

Predietion  performanee  of  individual  functional  association  scores  and  their  eombination 
using  ADT  method  is  shown  for  E.  coli  metabolie  enzymes  in  Figure  4a.  The  figure 
illustrates  that  predietions  based  on  the  eombined  evidenee  are  elearly  superior  to  what  is 
aehieved  by  any  individual  type  of  funetional  assoeiation  evidenee,  with  43%  of  known 
enzymes  predieted  as  number  one  eandidates  for  their  enzymatie  funetion,  and  60% 
within  the  top  10  eandidates.  Assoeiations  based  on  the  ehromosome  elustering  provide 
the  best  predietions  of  any  single  evidence  type,  and  are  able  to  prediet  almost  half  of  the 
metabolic  enzymes  within  the  top  10  eandidates.  It  is  also  important  to  note  that  different 
assoeiation  evidenee  types  are  not  redundant  -  none  of  the  predictions  based  on  a 
partieular  assoeiation  seore  are  eompletely  eovered  by  the  predietions  of  another 
assoeiation  seore. 

Individual  and  eombined  predietion  performanee  for  enzymes  of  S.  cerevisiae  metabolie 
network  is  illustrated  in  Figure  4b.  Relative  to  E.  coli  predictions,  co-expression  seore  in 
S.  cerevisiae  tends  to  perform  better;  however  ehromosome  clustering  and  phylogenetic 
profile  assoeiation  seores  perform  worse.  The  overall  level  of  performance  is  also  lower, 
with  approximately  60%  of  the  enzymes  predicted  within  top  50  eandidates  (eompared  to 
71%  'm  E.  coli).  The  performanee  differenee  ean  be  partially  attributed  to  lower  number 
of  eandidate  genes  in  E.  coli  (3351  as  opposed  to  5252  in  S.  cerevisiae)  and  wider 
availability  of  the  genomie  data  for  baeterial  organisms.  For  example,  ehromosome 
elustering  assoeiations  were  ealculated  on  a  dataset  that  eontains  nearly  a  hundred 
bacterial  species  and  only  a  handful  of  eukaryotie  genomes. 
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5,3,  Methods,  Assumptions,  and  Procedures 
Metabolic  neighborhoods  and  network  representation 

The  metabolic  network  was  represented  as  a  graph,  with  nodes  corresponding  to 
metabolic  enzyme-encoding  genes  and  edges  to  connections  established  by  the  metabolic 
reactions  (Kharchenko  et  al.  2005).  Two  metabolic  genes  are  connected  if  the  enzymes 
they  encode  share  a  metabolite  among  the  set  of  reactants  or  products  of  the  reactions 
they  catalyze.  Metabolic  network  distance  between  enzyme-encoding  genes  is  calculated 
as  a  shortest  path  in  the  graph.  Distance  of  directly  connected  genes  is  taken  to  be  1 .  A 
metabolic  neighborhood  layer  of  a  radius  R  around  a  metabolic  enzyme  X  is  defined  as 
a  set  of  all  enzyme-encoding  genes  that  are  at  the  distance  R  from  the  enzyme  X .  A 
metabolic  neighborhood  of  radius  i?  is  a  set  of  neighborhood  layers  of  radii  r  <R 
(Figure  Cl  a). 

Detailed  metabolic  models  of  E.  coli  (Reed  et  al.  2003)  and  S.  cerevisiae  (Forster  et  al. 
2003)  were  used  to  compile  comprehensive  connectivity  graphs  for  these  organisms, 
excluding  metabolic  connections  established  by  the  following  top  14  most  common 
metabolites:  ATP,  ADP,  AMP,  CO2,  CoA,  glutamate,  H,  NAD,  NADH,  NADP,  NADPH, 
NH3,  orthophosphate  and  pyrophosphate  (and  corresponding  mitochondrial  and  external 
species). 

Self-rank  validation 

To  assess  performance  of  our  method  we  use  self-rank  measure,  which  quantifies  the 
ability  to  predict  known  metabolic  enzymes.  A  self-rank  of  a  known  enzyme-encoding 
gene  is  defined  as  a  rank  of  that  gene  among  a  set  of  candidates  in  an  ordering 
determined  by  our  algorithm  (Figure  Clb).  A  set  of  candidates  consist  of  all  genes  in  the 
organism  that  do  not  already  appear  in  the  metabolic  graph  (i.e.  non-metabolic  genes) 
and  the  known  enzyme-encoding  gene  that  is  being  tested.  A  candidate  set  for  E.  coli 
contained  3351  open  reading  frames  (ORFs),  and  for  S.  cerevisiae  5252  ORFs.  A  perfect 
prediction  algorithm  would  result  in  a  self-rank  of  1  (top  candidate)  for  every  metabolic 
enzyme,  and  a  completely  non-informative  method  would  result  in  a  uniform  distribution 
of  ranks  (on  the  range  from  1  to  the  size  of  the  candidate  set). 


The  overall  performance  of  the  method  was  measured  by  evaluating  self-ranks  of  a  set  of 
known  enzyme-encoding  genes.  This  set  contains  all  known  metabolic  enzymes  in  an 
organism,  except  for  the  enzymes  that  have  high  sequence  homology  (BLASTp  E  value 
below  10  '°)  to  some  other  known  metabolic  enzyme  in  that  organism  (paralogs).  The 
exclusion  of  such  paralogous  pairs  aims  to  avoid  bias  stemming  from  overlapping 
ortholog  mappings.  The  resulting  set  contained  351  enzymes  from  E.  coli  metabolism, 
and  240  from  S.  cerevisiae. 

Orthology  datasets 

The  KEGG  ortholog  dataset  was  retrieved  from  Sequence  Similarity  Data 
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Base  (SSDB)  (01/2005).  All  available  closest  homologs  and  best  bi-directional  hits  of  E. 
coli  and  S.  cerevisiae  genes  were  recorded.  The  BLAST-based  dataset  was  constructed 
using  BLASTp  queries  against  NCBl  NR  protein  dataset  (03/2005),  using  E-value  cutoff 
of  10'  and  limiting  the  maximum  number  of  homologs  per  query  to  6000.  To  determine 
best  bi-directional  hits,  reverse  BLASTp  queries  were  run  for  every  hit  against  target 
genome  {E.  coli  or  S.  cerevisiae).  NCBI  taxonomy  identifiers  were  used  to  group  hits 
belonging  to  the  same  organism.  For  E.  coli  only  organisms  containing  orthologs  to  more 
than  4%  of  genes  were  considered  (7%  for  S.  cerevisiae).  We  found  that  performance  of 
analogous  datasets  constructed  using  TBLASTN  queries  was  similar. 

Phylogenetic  profile  co-occurrence 

Given  a  set  of  genomes  G  =  |,  a  phylogenetic  profile  of  a  gene  was  represented 

as  a  binary  vector  p  of  length  Nq  ,  such  that  p^  =1  if  an  orthologous  gene  is  present  in 
genome  G. ,  and  p.  =  0  otherwise. 


Assuming  that  orthologs  are  independently  identically  distributed  (IID)  within  each 
genome  G, ,  the  probability  of  observing  two  profiles  of  a  given  similarity  under  the  null 


hypothesis  is  calculated  using  hypergeometric  distribution  (Wu  et  al.  2003): 


p[k\n,m,N)  = 


(n^ 

'N-n 

jn  -  k  ^ 

rN'^ 


where  k  is  the  number  of  ortholog  co-occurrences,  N  is  the  size  of  the  genome  set  G  ,  n 
and  m  correspond  to  the  number  of  orthologs  in  the  two  profiles  being  compared.  The 
probability  of  functional  association  is  then  given  by  Passodation  =  l-Y,P(k\n,m,N), 


k>K 


where  K  is  the  number  of  actual  ortholog  co-occurrences  observed  between  two  specific 
profiles  (Bowers  et  al.  2004). 


If  the  assumption  of  identical  ortholog  distribution  within  each  genome  is  relaxed, 
probability  p[k\n,m,N)  is  distributed  as  a  sum  of  independent,  non-identical  Bernoulli 

variables  x, :  A:  ~  ^  x,  ,  with  /»(x, )  corresponding  to  the  probability  of  observing  a 

min(«,A«) 

match  in  a  given  genome  i.  This  is  a  special  case  of  the  Extended  Multivariable 
Hypergeometric  distribution  (Harkness  1965). 


To  determine  the  probability  of  observing  k  ortholog  co-occurrences  between  profiles  of 
a  given  gene  x  and  some  other  geney ,  p[k\n,m,N),  we  calculate  Pj^[k\n,m)  given  by 

the  recursive  formula  below.  In  general,  P.{k'\n',m')  is  the  probability  of  observing  k' 
ortholog  co-occurrences  at  a  current  (G,  )  or  subsequent  genomes  as  we  walk  along  a 
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predefined  genome  order,  where  n' and  m'  is  the  number  of  orthologs  of  gene  x  and 
gene  y  respeetively  in  genomes  Gj  such  that  j<i.  p^y,  m')  is  defined  recursively  as: 

P.  (ky,  m)  =  pI^.  [p‘:P,_,  [k'  -  \\n  - 1,  m'  - 1)+  (l  -  p\  (ky,  m  - 1)]+  (l  -  pl^.  [ky ,  m) 
where  i  is  the  genome  index,  p°^,  is  the  probability  of  one  of  the  remaining  m' 
orthologs  of  gene  y  occurring  in  genome  G. ,  and  p.  is  the  probability  of  ortholog  co¬ 
occurrence  in  the  genome  G. . 


Given  the  values  of  probabilities  and  p%  the  value  of  Pyy,m')  is  computed 
using  a  dynamic  programming  approach,  p.  is  equal  to  0  or  1 ,  depending  on  whether  an 
ortholog  of  gene  x  is  present  in  genome  G. .  The  consideration  of  non-identical 
distribution  of  ortholog  frequency  within  each  genome  is  then  localized  to  p°^, ,  which  in 

this  case  is  distributed  according  to  the  marginal  Extended  Hypergeometric  distribution. 
The  marginal  form  of  the  distribution  is  more  amenable  to  the  computational 
approximations  than  the  regular  form.  Since  p^^,  does  not  depend  on  the  choice  of  genes 

X  and  y ,  we  sample  p°^,  computationally,  taking  into  account  individual  ortholog 
occurrence  frequencies  of  each  genome.  The  probability  of  ortholog  occurrence  in  a 
specific  genome  (py  )  was  sampled  computationally  by  drawing  from  the  set  of 

organisms  without  replacement  with  relative  probabilities  corresponding  to  the  rate  of 
ortholog  occurrences  in  each  genome.  In  each  iteration,  draws  were  performed  until  all  of 
the  organisms  were  drawn.  A  total  of  10^  such  iterations  were  performed. 


To  correct  for  non-independent  ortholog  occurrence  rates,  we  first  evaluate  the  distance 

MI{X,Y) 


between  a  pair  of  query  genomes  X  and  Y  as  d(X,Y)  = 


mm{HiX),H{Y)) 


,  where 


MI{X,Y)  is  mutual  information  between  ortholog  occurrence  vectors  for  genomes  X 
and  Y ,  and  H{)  is  Shannon  entropy  of  each  vector.  The  ortholog  occurrence  vector  for  a 
query  genome  X  is  a  binary  vector  of  length  Ngems  (number  of  genes  in  a  target  organism, 
i.e.  E.  coli),  such  that  the  value  of  the  1°^  element  is  1  if  ortholog  of  an  gene  is  found  in 
X,  and  0  otherwise.  Clusters  of  closely  related  organisms  ( J  <  0.8)  were  determined  by 
the  greedy  neighbor-joining  method.  Several  ways  of  summarizing  the  ortholog  co¬ 
occurrence  vector  for  a  cluster  of  closely  related  organisms  were  tested:  selecting  the 
organism  with  highest  entropy,  using  AND/OR  functions,  and  using  majority  rule.  We 
find  that  performance  of  AND  function  is  optimal  for  the  threshold  of  J  <  0.8 ,  however 
for  higher  thresholds  selecting  an  organism  with  highest  entropy  results  in  better 
performance. 


In  evaluating  performance  without  adjacency-rank  rescaling  (Figure  C2),  total 
phylogenetic  profile  association  score  between  a  candidate  gene  x  and  a  metabolic 

neighborhood  layer  L  was  calculated  as  score (x)  =  V  — — ,  where  P()  is  the 

T^iPixg) 
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probability  of  observing  a  given  number  of  ortholog  co-oecurrences  ealculated  using  one 
of  described  probability  distributions. 

To  estimate  self-rank  performance  of  the  COG  dataset  correcting  for  the  bias  in  orthology 
dataset  coverage  (Figure  C2b),  the  fraction  of  true  enzyme-encoding  genes,  /  predicted 
within  a  particular  self-rank  threshold  t  was  calculated  as  f{t)  =  af^f'{a^t),  where 
is  the  fraction  of  test  metabolic  enzyme-encoding  genes  covered  by  the  COG  dataset, 
is  the  fraction  of  candidate  set  genes  (non-metabolic)  covered  by  the  dataset,  and /'  is 
the  performance  on  the  set  of  metabolic  and  candidate  genes  covered  by  the  COG  dataset. 

Gene  co-expression 

The  co-expression  association  value  was  calculated  as  a  Spearman  rank  correlation  (Press 
et  al.  2002)  between  expression  profdes.  E.  coli  co-expression  was  calculated  based  on 
the  180  conditions  from  the  Stanford  Microarray  Database  (Sherlock  et  al.  2001).  S. 
cerevisiae  co-expression  was  measured  based  on  the  mRNA  expression  profiles  from 
Rosetta  “compendium”  dataset  (Hughes  et  al.  2000).  Log  10  intensity  ratio  data  was  used. 
Co-expression  of  orthologous  genes  was  determined  using  the  KEGG  ortholog  dataset. 

Clustering  on  the  chromosome 

The  degree  to  which  orthologs  of  two  genes  are  clustered  on  the  chromosome  was 
calculated  based  on  the  null  hypothesis  that  genes  are  randomly  distributed  across  the 
chromosomes.  Instead  of  considering  gene  sizes  and  exact  nucleotide  positions,  we 
concentrated  on  gene  order  statistics.  The  association  strength  was  determined  as  the 
probability  of  chromosome  gene  order  position  of  a  candidate  gene  x  for  a  metabolic 
neighborhood  layer  A,;  P{x\N ,)  =  Y\WP{dg{x,y)\  where  G  is  a  set  of  query 

geG 

genomes  in  which  orthologs  of  both  x  and  y  can  be  found,  and  P[d^(x,y))  is  the 

probability  of  observing  gene  order  distance  (x,  y)  between  genes  x  and  y  in  a  genome 

g.  This  was  calculated  directly,  based  on  the  organism  chromosome  sizes  under  the  null 
hypothesis.  The  above  formulation  is  based  on  two  major  assumptions:  (1)  gene  order 
distances  to  different  genes  of  the  neighborhood  layer  A,  are  independent,  and  (2)  gene 

order  distances  between  a  specific  pair  of  genes  are  independent  across  different 
organisms. 

The  results  are  based  on  a  set  of  105  bacterial  and  three  eukaryotic  genomes  {S. 
cerevisiae,  S.  pombe,  C.  elegans)  from  Genbank.  The  set  was  screened  to  eliminate 
closely  related  species  using  ortholog  occurrence  mutual  information  threshold  of  0.9. 
Orthology  mapping  was  established  using  KEGG  SSDB  best  bi-directional  hits. 

Protein  interactions 

Interaction  likelihood  ratios  from  the  PIE  dataset  by  Jansen  et  al.  (Jansen  et  al.  2003) 
were  used  as  pair-wise  protein  interaction  association  values. 

Protein  fusions 
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Two  proteins  x  and  j  of  a  target  genome  {S.  cerevisiae  and  E.  coli)  were  taken  to  be 
associated  through  a  protein  fusion  event  if  both  of  the  following  conditions  were  met: 

\.)  X  and  y  are  homologous  to  the  same  protein  z  in  one  of  the  query  genomes  with  a 
BLASTp  E  value  below  a  specified  threshold  and  with  at  least  70%  of 

their  sequences  aligned  to  z. 

2.)  X  andy  align  to  different  regions  of  z,  or  to  regions  overlapped  by  no  more  than  10% 
of  the  shorter  protein  among  x  and  y.  If  x  or  y  align  to  multiple  regions  of  z,  then  any 
two  regions  must  not  overlap. 

A  set  of  70  query  genomes,  based  on  the  study  by  Bowers  et  al.  (Bowers  et  al.  2004),  was 
downloaded  from  the  Entrez  Genome  database: 
(http://www.ncbi.nlm.nih.gov/entrez/querv.fcgi?db=Genome). 

Several  values  of  were  used  in  generating  enzyme-encoding  gene  predictions 

(Eigures  C3  and  C4),  with  10‘^  and  10'^°  forE.  coir,  10'^  and 

lO’'**  for  A  cerevisiae. 


Adjacency-rank  score  rescaling,  metabolite  weighting  and  calculation  of  layer 
association  scores 

To  perform  adjacency-rank  score  rescaling  of  raw  pair-wise  association  values,  we 
calculate  the  adjacency  likelihood  ratio  for  a  pair  of  genes  x  and  y  as: 


alr{x,y)  = 


N, 


max 


}),  where  is  a  rank  of  gene  y  among  a  set  of 


raw  association  values  between  gene  x  and  all  other  genes  in  the  organism.  Lower  ranks 
correspond  to  higher  stringency  of  association.  The  probability  is  calculated  from  an 

empirical  distribution  of  all  ranks  ,  such  that  genes  a  and  b  are  adjacent  to  each  other 


(i.e.  directly  connected)  in  the  metabolic  network.  is  the  number  of  genes  in  an 

organism. 


Without  metabolite  weighting,  the  total  association  score  between  a  candidate  gene  x 
and  a  metabolic  neighborhood  layer  L  is  calculated  as:  s'core^(x)  =  ^exp[a/r(x,g)]. 


Metabolite  weighting  is  incorporated  by  calculating  total  association  score  as: 
score i^{x)  =  exp[a/r(x,g)],  where 

gei 


^g  = 


n 


1 


m,  is  the  metabolite  in  the 


mj  E©-"*  pairs 


shortest  path  0  connecting  neighborhood  gene  g  with  the  missing  enzyme.  is  the 

total  number  of  gene  pairs  connected  by  a  metabolite  m. .  If  more  than  one  metabolite 
connects  genes  along  the  path  0 ,  a  metabolite  with  the  smallest  N  .  is  used. 


Direct  likelihood-ratio  predictor  method 

The  placement  algorithm  considers  each  candidate  gene  by  evaluating  P{M  \  D) ,  which 
is  the  conditional  probability  that  a  given  candidate  encodes  the  desired  enzyme 
(model, M)  given  all  available  evidence  (data,/)).  Eollowing  Bayes  rule  we  can 
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calculate  that  probability  (up  to  a  constant)  using:  P{M  \  D)  oc 


P{D  I  M) 
P{D) 


where 


P{D  I  M)  is  the  probability  of  observing  existing  assoeiating  evidenee  for  a  true  enzyme- 
eneoding  gene.  Assuming  that  different  types  of  assoeiative  evidenee  scores  are 
independent,  we  ealculate  probabilities  as:  P{D)  =  Y[  Pe  (De )  ’  where  P^  {D^ ) 


eorresponds  to  the  posterior  of  evidence  type  e .  The  problem  is  therefore  transformed 
into  estimating  tails  of  association  score  probability  distributions  over  all  genes,  and 
enzyme-eneoding  genes.  For  different  types  of  associating  evidence  scores  these 
probabilities  were  evaluated  empirically  from  the  gene  counts,  assuming  that  the 
likelihood  of  association  increases  monotonieally  with  the  absolute  value  of  the  seore. 


The  self-rank  evaluations  of  known  E.  coli  and  S.  cerevisiae  metabolic  enzyme-encoding 
genes  (see  Self-rank  validation  seetion  in  5.3  Methods)  were  performed  using  a  leave- 
one-out  validation  strategy.  In  other  words,  in  eaeh  ease,  seores  of  the  candidate  being 
evaluated  are  not  included  when  ealculating  P{M  \  D) . 


Alternating  decision  tree  predictor 

The  mljava  implementation  of  the  AdaBoost  algorithm  (Freund  and  Mason  1999)  was 
used  to  build  ADT  classifiers  based  on  a  set  of  descriptors,  corresponding  to  different 
association  scores  with  individual  layers  of  the  metabolie  network  neighborhood.  The 
results  presented  in  Figures  C3  and  C4  are  based  on  10-fold  validation,  100  iterations  of 
boosting.  The  training  sets  included  data  on  only  60%  of  the  true  negative  (non- 
metabolic)  genes  in  order  to  minimize  computational  time.  The  candidate  genes  were 
prioritized  according  to  the  value  of  the  classifieation  margin. 

Predictions  with  combined  association  evidence 

The  self-rank  performanee  illustrated  in  Figures  C3  and  C4  was  ealeulated  based  on 
eandidate  assoeiation  with  first  three  layers  of  metabolie  network  neighborhood. 
Association  with  respect  to  each  layer  was  described  by  a  separate  association  score.  The 
predictions  were  performed  using  association  score  ranks:  given  a  eandidate  gene  x  for  a 
missing  enzyme  e ,  the  value  of  a  deseriptor  was  ealeulated  as  a  rank  of  score in  a 
set  of  scores  S  =  [score ^  (y)}y^c  ’  where  C  is  a  set  of  all  candidates  for  a  missing  enzyme 

e ,  with  higher  ranks  eorresponding  to  stronger  assoeiations.  For  the  E.  coli  metabolic 
model,  the  following  association  scores  were  used: 

Phylogenetic  profile  eo-occurrence,  ealeulated  with  extended  hypergeometrie  and 
folding  correetions,  on  orthologs  established  by  best  bi-direetional  homology 
relationship.  Separate  scores  were  calculated  using  BLAST-based  and  KEGG-based 
orthology  data. 

Chromosome  clustering. 

Gene  co-expression.  Separate  scores  were  ealeulated  for  E.  coli  expression  data,  and 
for  expression  of  E.  coli  orthologs  in  S.  cerevisiae  dataset. 

Protein  fusion.  Separate  scores  were  ealeulated  for  different  values  of  :  10'  , 

10'^  and  10''°. 
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Analogous  scores  were  used  for  predictions  on  S.  cerevisiae  metabolic  model,  with 
addition  of  a  protein  interaetion  score.  In  the  ease  of  protein  fusion,  the  seores  were 
ealeulated  for  the  following  values  of  :  10'^,  10'^  and  10'^°. 


5.4,  Conclusion 

The  results  presented  in  this  work  demonstrate  that  the  gene  eneoding  a  speeifie 
metabolie  function  can  be  effeetively  identified  from  eombined  funetional  assoeiation 
with  the  metabolic  network  neighborhood  of  the  desired  funetion.  This  indicates  that  the 
relationships  established  by  the  local  structure  of  the  metabolie  network  impose 
eonstraints  on  a  wide  range  of  natural  proeesses,  such  as  gene  expression  or  evolutionary 
processes  on  both  moleeular  and  genomic  scales.  Our  tests  used  a  eombination  of 
genome  context  and  expression  data  to  identify  known  E.  coli  metabolic  enzymes, 
predieting  them  within  the  top  10  (out  of  3351)  eandidates  in  60%  of  the  cases.  We  show 
that  in  the  case  of  both  E.  coli  and  S.  cerevisiae,  combining  multiple  types  of  assoeiation 
evidence  results  in  a  signifieantly  better  predietion  performanee  than  that  of  any 
individual  association  score. 

In  validating  the  performance  of  our  method,  we  relied  on  the  metabolic  network 
neighborhood  as  the  sole  source  of  information  about  the  desired  enzymatie  activity.  In 
practice,  additional  elues  regarding  aetivity  or  physieal  properties  of  the  unidentified 
enzyme  can  often  be  used  to  narrow  down  the  set  of  candidates.  These  additional  clues 
may  provide  restrictions  on  the  phylogenetic  profile  pattern,  protein  size,  presence  or 
absenee  of  membrane  spanning  regions  or  speeifie  protein  domains.  For  example,  for  E. 
coli  arabinose-5 -phosphate  isomerase,  yrbH  (Meredith  and  Woodard  2003)  is  predieted 
as  a  10**'  candidate  among  all  genes,  but  is  the  only  eandidate  within  the  top  50  with  a 
putative  sugar  isomerase  domain. 

Sequence  homology  to  known  proteins  remains  the  primary  method  of  identifying 
missing  enzymes  (Huynen  et  al.  2003;  Osterman  and  Overbeek  2003).  Predietions  based 
on  the  association  evidence  considered  in  this  work  are  complementary  to  homology- 
based  methods,  and  can  be  used  to  target  enzymes  that  have  not  been  identified  in  any 
organism  (referred  to  as  globally  miss'mg  enzymes  by  Osterman  et  al.).  Integration  of 
genome  eontext  information  into  the  refined  sequenee  homology  searches  has  been 
shown  to  improve  the  predietions  (Green  and  Karp  2004).  It  will  be  important  to  analyze 
how  ineorporation  of  diverse  association  evidence  presented  in  this  work  would  improve 
the  performanee,  in  particular  with  respect  to  the  difficult  cases  of  weak  or  ambiguous 
sequenee  homology.  The  overall  performance  of  the  presented  method  ean  be  improved 
in  a  number  of  ways.  The  datasets  underlying  individual  seores  can  be  expanded. 
Genome  divergenee  eorreetions  for  the  ehromosome  clustering  seore  are  also  likely  to 
improve  the  results.  Further  extensions  can  provide  better  identification  in  the  eases 
where  multiple  missing  genes  appear  within  the  same  metabolie  neighborhood.  We  hope 
that  the 

presented  method,  and  its  future  derivations,  will  be  important  in  completing  metabolie 
models  of  different  organisms. 
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Table  Cl:  Association  scores  used  in  self-rank  tests  on  combined  evidence 


Table  C 1 :  Association  scores  used  in  self-rank  tests  on  combined  evidence 


Evidence  type  \  Oraanisin 

Ecoli 

S. cerevisiae 

Phylogenetic  profile  co¬ 
occurrence. 

•  BLAST-based  dataset  score 

•  KEGG-based  dataset  score 

Pairwise  associations  were  calculated  using  extended 
hypergeometric  and  folding  corrections,  on  orthologs 
establislied  bv  best  bi-directional  homoloav  relationshiiv 

Clustering  of  genes  on  tlie 
chromosome 

•  Gene  clustering  scores.  Paiixvise  associations  were 
calculated  on  108  genomes,  with  KEGG-based 
ortholoL'\  dataset 

Gene  co-expression 

•  £.  coli  SMD  expression 
dataset  score 

•  Expression  of  E.  coli 
orthologs  in  S. 
cerevisiae  Rosetta 
dataset. 

•  S.  cerevisiae  Rosetta 
expression  dataset  score 

•  Expression  of  S. 
cere\'isiae  orthologs  in 
£  coli  SMD  dataset. 

Protein  fusion 

Septtrate  scores  were 
calculated  for  different 

values  of 

•  10'- 

•  lO'-'* 

•  lo-'® 

Separate  scores  were 
calculated  for  different 

values  of 

•  10-^ 

•  10-’ 

•  10’“’ 

Protein  interactions 

•  interaction  score  based 
on  PIE  dataset 
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Figure  Cl.a.  Illustration  of  the  missing  gene  problem. 

Figure  Cl.b.  Illustration  of  the  Self-rank  validation  test 

Metabolic  network  neighborhood  of  a  missing  metabolic  enzyme  is  shown.  The 
neighborhood  comprises  layers  with  increasing  radii  (indicated  by  shading).  Majority  of 
the  enzyme-encoding  genes  in  the  neighborhood  are  known,  b.  Illustration  of  the  self¬ 
rank  validation  test.  Ability  to  predict  known  enzyme-encoding  genes  is  tested  by 
measuring  the  rank  of  a  true  enzyme-encoding  gene  in  the  candidate  set.  The  candidates 
are  ordered  according  to  overall  strength  of  functional  association  with  the  metabolic 
network  neighborhood  of  the  enzyme.  The  set  contains  all  genes  that  are  not  already  part 
of  the  metabolic  network. 


a. 


b. 


self-rank  threshold 


Figure  C2a.  Performance  of  different  phylogenetic  profile  datasets  and  corrections.. 
Figure  C2b.  The  self-rank  performance  of  the  1*‘  layer  phylogenetic  profile  score 
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a,  A  cumulative  self-rank  distribution  is  shown  for  E.  coli  enzymes,  as  predicted  based  on 
the  phylogenetic  profile  associations  with  the  l'^*  layer  of  the  metabolic  network 
neighborhood.  Performance  of  a  regular  hypergeometric  distribution  is  shown  (HG), 
together  with  extended  hypergeometric  (xHG)  and  folding  (xHG+folding)  corrections. 
The  scores  are  calculated  on  the  BLAST-based  dataset,  b.  The  self-rank  performance  of 
the  l’^'  layer  phylogenetic  profile  score,  calculated  using  extended  hypergeometric 
distribution  with  folding  is  shown  for  BLAST-based,  KEGG-based  and  COG  orthology 
datasets.  The  performance  of  the  COG  orthology  dataset  is  corrected  for  the  metabolic 
gene  coverage  bias. 


self-rank  threshold 


Figure  C3.  Comparison  of  ADT  and  DLR  methods  for  combining  multiple  association  evidence 

types. 

Cumulative  self-rank  distribution  is  shown  for  E.  coli  and  S.  cerevisiae  metabolic 
enzymes  based  on  the  combined  association  evidence  (see  Methods).  The  performance  is 
compared  for  DLR  (dashed  curves)  and  ADT  (solid  curves)  methods. 
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3.  E.coli 


combined  • — •  chromosome  clustering  «  •  phylogenetic  profiles  • — •  co-expression  .  protein  fusion 

• — •  protein  Interactions  - random  expectation 


Figure  C4.  Enzyme  predictions  based  on  individual  and  combined  types  of  association  evidence. 

Cumulative  self-rank  distribution  is  shown  for  the  metabolic  enzymes  of  a,  E.  coli 
metabolism  and  b,  S.  cerevisiae  metabolism.  Predictions  are  generated  based  on 
association  with  the  first  three  layers  of  the  metabolic  network  neighborhood,  using  ADT 
classifier  with  10-foId  validation. 
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6  Conclusion 


The  ideas  in  our  2001  DARPA  grant  proposal  about  design  of  complex  synthetic 
biological  polymers  and  software  to  aid  this  have  coalesced  into  a  vibrant  field  (e.g.  the 
recent  Dec  2005  issue  of  Nature  has  a  collection  of  articles  related  to  the  new  discipline 
of  Synthetic  Biology).  Initial  designs  focused  on  biopolymer  synthesis  (DNA,  RNA, 
protein),  in  vitro.  That  project  was  expanded  with  supplementary  funding  to  include 
experimental  work  which  was  then  transitioned  out  to  a  DOE  GtL  Center  grant  which 
resulted  in  a  Nature  paper  and  commercial  licensing  (to  CodonDevices).  The  metabolic 
modeling  (MOMA)  has  developed  as  very  useful  in  stand-alone  software  and  as  an 
example  of  the  successes  and  challenges  on  merging  such  software  into  the 
BioSPICE/BioCOMP  community  vision.  The  metabolic  modeling  has  also  been  very 
successful  and  lives  on  in  our  Harvard/MIT  DOE  GTE  Center.  The  BioSpice  component 
led  by  Sri  Kumar  and  others  in  our  DARPA  BioCOMP  grant  program  aimed  to  develop 
interoperability  and  constituted  a  longer-term  community-building  exercise. 

Technology  Transition: 

How  the  impact  of  this  work  is  measured;  Eiterature  citations  and  milestones  of  licensees 
set  by  Harvard  Medical  School  Office  of  Technical  Eicensing  (HMS  OTL)  (Maryanne 
Eenerj  ian  <maryanne_fenerj  ian@hms  .harvard.  edu>) 

Prototype  available  for  dissemination;  In  situ  fluorescent  base  extension.  Purpose; 

Identity  and  quantitation  based  on  single  DNA  molecules.  Environment  requirements; 
Research  laboratory.  Point  of  contact  /  email  address;  Jay  Shendure 
<jay_shendure@student.hms.harvard.edu>  http;//arep.med.harvard.edu/Polonator/ 

Systems  available  for  dissemination: 

MapQuant.  Purpose;  New  Open-Source  Software  for  Earge-Scale  Protein  Quantitation 
Environment  requirements;  Research  computers  supporting  Einux  or  Windows.  Point  of 
contact  /  email  address;  Kyriacos  Eeptos  <leptos@fas. harvard. edu> 
http  ;//club  .med.harvard.  edu/MapQuant/ 

Minimization  of  Metabolic  Adjustments  (MOMA).  Purpose;  Optimization  of  metabolic 
network  utilization  in  engineered  (or  mutant)  genomes.  Environment  requirements; 
Research  computers  supporting  Perl  &  C.  Point  of  contact  /  email  address;  Daniel  Segre 
<dsegre@genetics.med.harvard.edu> 
http;//arep.med.harvard.edu/moma/ 
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